Blue Coat Labs

Labs Blog

Bad Guys Using Internationalized Domain Names (IDNs)

Bad Guys Using Internationalized Domain Names (IDNs)

Chris Larsen, Tim van der Horst

[This is the research we presented at this year's RSA Conference. Apologies for having taken so long to get it written up! Please be aware that this document contains a large number of "exotic" characters, from many different languages and character sets. Depending on your operating system, browser, and installed fonts, not all of the characters may display properly. (BTW, kudos to Microsoft's PowerPoint team: it handled all of these with flying colors when we built our original presentation.) --C.L.]



Where in the World is xn--80atbrbl6f.xn--p1ai?


International Criminals Hiding Out in Internationalized Domain Names



Table of Contents:


Part 1: General Introduction

  -1a- A (Very) Brief History of IDNs and Punycode

  -1b- A (Very) Brief Tutorial on IDNs/Punycode

  -1c- Support for IDNs/Punycode

Part 2: Why Should I Care?

Part 3: IDN Abuse and Deception

  -3a- The Basics

  -3b- Real-world Examples

  -3c- Unicode-only Names that Look Like ASCII

  -3d- More Fun With IDNs

Part 4: Some IDN Statistics

Part 5: Malicious Use of IDNs

  -5a- IDNs and Malware

  -5b- IDNs and Botnets

Part 6: Summary and Recommendations

  -6a- Brand Abuse

  -6b- Security Posture


Part 1a - A (Very) Brief History of IDNs and Punycode

The original Internet was ASCII only, which isn't really surprising, as it was built in the United States, and English is a language that can be written entirely in characters in the ASCII set.

Most of the World's languages, however, require characters that are not found in ASCII. This has led to dozens of various character sets (or "encodings") in other countries and regions. Out of these historical encodings emerged Unicode, which is one of the coolest technologies since Morse Code...

Today, most of the non-English documents on the Web (and many of the English ones, for that matter) are in Unicode. While this made things much more convenient for viewing documents, it raised the question of why Internet domain names were still only available in ASCII letters, digits, and the dash character. And this eventually led to punycode.


A rough timeline of IDNs and punycode would begin with the initial proposal in 1996; by 2003, some registries began allowing IDNs (e.g., .cn, .info, .jp, .tw, etc.). By 2008-2009, the IDN/punycode spec was finalized, and testing had begun on new "IDN ccTLDs", so that not just the domain name, but the Top-level Domain (TLD) could be in a native language encoding.

The history of IDNs within the WebPulse ecosystem matches up very well with this timeline.  (More on this later...)


Part 1b - A (Very) Brief Tutorial on IDNs/Punycode

In a nutshell, an IDN is a domain name in "native" form (actually, any portion, or label, of a host name: a subdomain, the root domain, or the top-level domain); to preserve compatibility with legacy Internet standards, primarily DNS, the native form is converted into an ASCII-only format called punycode. Here are some simple examples, with the punycode version on the left:

  • = bluecȯ
  • = bluecoạ
  • = bluecoaṫ.com


To mentally decode a punycode URL, start from the “xn--” (which signals "this label is in punycode"), and move to the last dash: the interval is the "ASCII part" of the label. In these examples, you can see that we've preserved seven of the eight ASCII characters in the original "bluecoat", and replaced a single non-ASCII character with a chunk of letters and digits.

These are found in the space from last dash to the dot -- this is the punycode version of whatever non-ASCII characters were in the original name. It's what we would call "deep magic", as the encoding represents both the specific characters and their offsets (within the ASCII portion) -- it isn't UTF-8, or UTF-16, or UTF-32, or "UTF-anything". (If you're curious about learning more, please refer to one of the Wikipedia links above.)


Part 1c - Support for IDNs/Punycode

All of the major browsers have long supported IDNs. Typically, they are auto-converted from native form to punycode form in the address bar, and when mousing over a link on the page. (The browser converts any IDNs it finds into the punycode form prior to submitting it to DNS.)

In applications like Microsoft Office and Adobe Acrobat, your mileage may vary. Typically, MS Office documents containing an IDN will display a warning if the link is clicked, and a mouseover displays the alternate form, since people were aware since the early days of IDNs that they could be used in deceptive attacks to mimic legitimate domain names. In Acrobat, the warning is more generic.

We also did some testing with iOS and Android devices. Both show IDN links "as-is" (natively), and you can do a "press and hold" action on the link to display the punycode. (It's doubtful that many mobile device users are even aware of the "press and hold" action, let alone using it regularly to check links before they click on them...)


Part 2 - Why Should I Care?

In 2003, we added our first IDN URL to the Blue Coat Web Filter database. And that was the Grand Total for IDNs that year: one. (Recall that this is the year that IDNs began to be allowed in some registries.)

In subsequent years, the number of IDNs we added to our database slowly increased:

year-over-year growth of IDN-based URLs in our traffic logs

As you can see, 2008-2009, when the spec was finalized, is when we began seeing an uptick, and then it really took off in 2012 and 2013. Clearly, it's something worth paying attention to. But it hadn't been on our radar screens much, and here's why...

Blue Coat is probably not too dissimilar to IT shops everywhere when it comes to dealing with new technology. There are a couple of basic questions to answer:

  • Does it break anything?
  • Is it dangerous?

And we'd dealt with these years ago: we'd updated our systems so we could process IDNs and punycode URLs without breaking, and we'd surveyed how they were being used. And guess what? There wasn't much, if any, shady use going on -- nearly all of the samples we saw in the logs five years ago were simply European companies who had a traditional ASCII domain name, but who wanted a native version with an accented character or two in it.

And so, since there wasn't much evil going on, we moved on to other, more interesting, areas of malware research.


As we got re-interested in IDNs last year, and realized how much shady stuff was going on, we wondered why IDNs hadn't made it onto our internal radar sooner -- so we did some digging into our database and logs:

  • One of the authors (Chris) had added a grand total of three punycode URLs to the database in the last 8 years.
  • A 3rd party feed of malicious domains that we use wasn't much better: it had provided a total of 24 punycode URLs over the years.
  • Meanwhile, our Malnet Tracker module had quietly added over 41,000 punycoded shady and evil domains.

Yikes! That got our attention, since each of those domains had shown up on certified Bad Guy servers. And so this project was born...


Part 3a -- IDN Abuse and Deception: The Basics

As mentioned previously, people were aware from the early days that IDNs could lead to mischief with domain names. One of the Wikipedia entries we linked above includes two example domain names that appear identical:

  • wіkіреdіа.org

But, if you copy these and paste them into your browser's address bar, you should see that the top one produces a punycode domain name, because most of its characters are not ASCII, but Cyrillic equivalents.

The potential for this sort of "homographic" attack isn't just theoretical. As part of our research, we found a handful of early reports of such attacks "in the wild", with this short post from Symantec being a good example. (We also found a few from Trend and Kaspersky, all from the "early days" of IDN usage.)


If you put yourself in the shoes of a Bad Guy who wants to be sneaky, remember that you have two forms of an IDN to consider: the native form and the punycode version. Either (or both) can be constructed to be deceptive:

Unicode Deceptive
(Punycode? Not so much...)
Punycode Deceptive
(Unicode? Not so much...)
Both Deceptive
bluecoaṫ.com bluecoat☮.com bluecoat®.com

A deceptive punycode name, where the "brand name" is left visually intact, as in the last two examples, is easy to achieve: simply add a non-ASCII character to the beginning, or end, or anywhere in the middle of, the brand name. The result is a punycode name that, to many users, simply represents the brand name plus some "computer codes".

Also worth noting is that the latter two examples highlight another interesting fact: the Unicode spec includes more than just exotic character sets -- it includes a bunch of symbols as well. Some of them are more believable than others... (By the way, that "registered trademark" symbol isn't what you think it is, either. It's actually a character from a section of Unicode that provides for "circled" versions of the regular letters A-Z -- both upper- and lowercase. Actual use of the Unicode "registered trademark" and "copyright" symbols is supposedly not allowed in IDNs.)


Part 3b -- IDN Abuse and Deception: Real-world Examples

Here are a few examples of domains that came through our logs last year:

IDN (Punycode) IDN (Unicode) google-Â ตลาดนัดไทย yōukù.com 、 ь texasholdé pó 㙐㙗㙖㙚㙍㙇

The taobao example is a favorite one. The Bad Guys created a domain name where the first character is a "kanji comma" -- that is, a "wide" comma used in Chinese and Japanese, where it needs to have the same width as the surrounding characters. This form of the name allows them to drop it into a string of Chinese characters, and the human eye will associate the comma with the characters it follows, not the part of the name. If anyone happens to notice that the HTTP link extends past the "taobao" back to the include the comma, it just looks like the web page developer made a simple mistake in the HTML. Sneaky!

Also, note that in the milliyet example, the whole string is the punycode (and it’s obviously just targeting Turkish speakers with the punycode version, as the Chinese version would be meaningless to them)...


And a few more illustrations of "brand name abuse" from our logs, to give a sense of the variety of names that are possible:

IDN (Punycode) IDN (Unicode) youtubç.com youtiû youtubeพ myspà hotù faσ maił


And here is an interesting family of shady IDNs, using the trick of prepending random Unicode characters to the beginning of the brand names, so that the brand names are there for the victims' eyes to focus on in both Unicode and Punycode form:

IDN (Unicode) IDN (Punycode)

(We'll come back to this batch a little later.)


Because of the way punycode works, domain names that look similar in Unicode form may look very dis-similar in punycode, which can obscure a common pattern, as in this group of domains -- the punycode versions give no clue that these domains are related, but the Unicode version makes it clear that they are:

IDN (Punycode) IDN (Unicode)
xn----8sbcifesaro0aiadpkhsbau проститутки-белгорода
xn----dtbiamanctnibbojhrbat проститутки-кемерово
xn----jtbhaciqjcjlnnbadm7r проститутки-тюмени
xn----otbahsfhjjbaleo9h проститутки-уфы
xn----8sbfmjdapdc2adefbhrttbavj9e проститутки-новокузнецка

(Since the common pattern in the native Russian Unicode domains is "prostitute", we left off the TLDs...)



Part 3c -- IDN Abuse and Deception: Unicode-only Names that Look Like ASCII

Keep in mind that there is no requirement that IDNs must contain some ASCII characters. If you think about it, why should a domain name in a language like Chinese, Japanese, Korean, Arabic, Thai, or Hebrew contain any ASCII letters? IDNs that are Unicode-only take the form of the standard "xn--" prefix, but because there is no ASCII portion, there is no additional dash character -- only the punycode. And although "normal" punycode has a mix of letters and numbers, it's perfectly valid for the encoding to produce a punycode string composed only of letters.

That little observation opens up another interesting avenue for IDN deception: take the brand name or identity you want to abuse, and use it as the entire punycode. As in these made-up examples, where we simply typed interesting words, names, and phrases into an IDN converter with an "xn--" prefix, and then looked at the Unicode they "decoded" into:

IDN (Punycode) IDN (Unicode) 憽憽憶懀憹憸.com ̗̗̗̇̍̈̒̐̑̕̕.com ´Ό   ‐ .com ƘƐƞƜƇƕƜƟƕƔƙƛƟƑ.com

These four examples illustrate the vastness of Unicode:

  • The "bluecoat" example decodes into essentially random Chinese characters.
  • The "rsaconference" example decodes into a ball of diacritics -- linguistic markers used in some languages for things like accent and stress.
  • The "hughthompson" example (Dr. Thompson is the RSA Conference program chair) yields something that looks a bit like an emoticon.
  • The "sharelearnsecure" example (taken from the 2014 RSA Conference theme) yields something that appears to a casual observer to be a reasonable-looking "foreign language" domain name.


Part 3d -- More Fun With IDNs

As mentioned earlier, Unicode is vast, and contains thousands of symbols and odd characters that can be used in very creative ways... These examples from last year's logs are only scratching the surface:

IDN (Punycode) IDN (Unicode) ¶.net yes♥ web°.com ノ.com í.com ёё.com •

While the intent behind the creation of IDNs may have been to allow domain names in various native scripts, symbols are clearly creeping in, and in relatively large numbers. Also of note is the huge volume of short one- and two-character domain names made possible in Unicode...

And for some really creative, unexpected, and just plain wild IDNs (and yes, these are real examples that came through in our traffic):

IDN (Unicode) IDN (Punycode)


And finally, our favorite, which totally choked the HTML editor, so we have to show it as a screenshot:URL built entirely of symbols

(Yes, Unicode has smileys. And a poo symbol.)


Part 4 -- Some IDN Statistics

IDNs may actually be used in any part (or "label") of a domain name: subdomain, domain, or TLD. Any in many cases, more than one label may be coded in punycode. In our logs for 2013, here's how we saw them:

[deeper subdomain] Subdomain2 Subdomain Domain TLD
0.36% 0.70% 6.05% 94.29% 7.95%

In other words, 7.95% of the time when an IDN occurred in our logs, the TLD was in punycode, 94.29% of the time, it was the domain name that was in punycode, and so on. The numbers add to more than 100%, since many IDNs had multiple punycoded labels:

IDN Labels per hostname
1 label 91.37%
2 labels 7.96%
3 labels 0.61%

(That is, 91.37% of the time when we saw an IDN-based URL, only a single label was in punycode.)


Looking at the total number of unique sites we saw in 2013, where at least one label in the domain was in punycode:

Rank TLD Percentage Notes
1 .com 43.30%  
2 .net 10.91%  
3 .de 9.79%  
4 .xn--p1ai 7.76% рф (i.e., "RF" -- Russian Federation -- in Cyrillic)
5 .se 3.28%  
6 .biz 2.94%  
7 .org 1.66%  
8 .ch 1.54%  
9 .dk 1.38%  
10 .info 1.37%  
35 .xn--3e0b707e 0.15% 한국 (Korea)
74 .xn--wgbh1c 0.01% مصر (Egypt)
77 .xn--mgbaam7a8h 0.01% امارات (Arab Emirates)
79 .xn--kpry57d 0.01% 台灣 (Taiwan)

(Showing the Top Ten most common TLDs seen in IDN URLs in our logs in 2013, plus a few others where the TLD itself was in punycode, to give a sense of "where the action is" by country in domain name registration -- which is clearly in Russia...)


If we look at the numbers by traffic, rather than by number of unique sites, the table is a bit different:

Rank TLD Percentage
1 .com 37.40%
2 .xn--p1ai 14.49%
3 .de 13.33%
4 .net 6.78%
5 .se 5.97%
6 .dk 2.23%
7 .ch 2.22%
8 .jp 1.73%
9 .fi 1.49%
10 .no 1.44%

(In other words, in terms of people visiting IDN sites, instead of just registering them, Europe is far and away the busiest.)


Another interesting way to look at IDNs is to look at which "blocks" of Unicode (i.e., roughly speaking, different languages or alphabets) the characters in IDNs come from:

Rank Unicode "block" Percentage Unique Chars Seen
1 Basic Latin 55.85% 38
2 Cyrillic 13.13% 164
3 CJK Unified Ideographs 10.79% 8771
4 Katakana 4.76% 95
5 Latin-1 Supplement 4.67% 128
6 Thai 3.26% 92
7 Hangul 2.96% 1811
8 Arabic 1.50% 163
9 Hebrew 1.05% 91
10 Hiragana 0.91% 88

(For example, 10.79% of the time we saw an IDN URL in 2013, it contained one or more characters from the "CJK Unified Ideographs" block in Unicode -- the "Chinese characters" used in Chinese, Japanese, and Korean -- and we saw a total of 8771 unique characters used in those URLs.)


And that leads us back to the family of shady sites we showed earlier, the phishing network that used domains with a random group of Unicode characters at the beginning of each domain name:

IDN (Unicode) Unicode Blocks Used
äժꑠ֩ Yi Syllables, Armenian, Hebrew
äձ䢈થ Gujarati, CJK Unified Ideographs, Armenian
ἀι̝蒐̲ CJK Unified Ideographs, Greek and Coptic
힠ໍາ矘ৗ Hangul Syllables, Bengali, CJK Unified Ideographs, Lao
醨ָ걸ড় CJK Unified Ideographs, Hangul Syllables, Bengali, Hebrew
칐ϟᕸୟ Hangul Syllables, Unified Canadian Aboriginal Syllabics, Greek and Coptic, Oriya

The Unicode characters are not just "random" -- they're so random that they actually come from different Unicode blocks. Meaning that letters from completely unrelated languages are appearing in the same chunk, which is linguistically unlikely, to say the least.


Part 5a -- IDNs and Malware

Up to now, we've looked at a mix of IDN examples, from whimsical ones used for fun, to deceptive ones used for phishing-style attacks.

Now we're ready to dive into examples of malicious IDNs. Again, these are taken from our 2013 traffic logs.

IDNs (Punycode) IDNs (Unicode) Notes
xn--80atbrbl6f.xn--p1ai пэкспак.рф A BlackHole exploit kit site. (This is the example IDN we used in our title; now you know: it's in Russia!) mail.yahó Had very little traffic; obviously designed as a phishing attack.
Domains appear to be parked, but traffic is characteristic of a "traffic driver" botnet. (Also, it's unusual to have Japanese, Polish, and French domains parked on the same server.)
Fake-codec sites, in attack via Facebook. (Note that Chinese names arguably relate to video playback.)


Next, here's a malvertising network, where a technical term was used as an ASCII-based subdomain, with an IDN as the main domain:

IDN (Punycode) IDN (Unicode) Character Blocks Used format.สดชื่น.com Thai office.李毅吧.com CJK Unified Ideographs scandisc.τουμπα.gr Greek and Coptic freesoft.朱晓辉.com CJK Unified Ideographs oracul.bä Latin freesocks.куписам.com Cyrillic

(In this case, the IDN domain names are internally consistent -- no mixing blocks -- but taken as a group, it's an unusual mix of languages to find in association.)

Next up is a network of exploit kit sites (Neutrino) that likewise jump all over the linguistic map:

IDN (Punycode) IDN (Unicode) Character Blocks Used กรมบังคับคดี.com Thai דשאמלאכותי.net Hebrew تحشيش.com Arabic mipiñ Latin бунгала.com Cyrillic 貓咪.com CJK Unified Ideographs


Back when we showed the charts of IDN usage statistics, you may have been wondering, why do the Russians need so many .xn--p1ai domains? Well, here's one answer: to serve malicious mobile apps! As this network was doing:

IDN (Punycode) IDN (Unicode)
xn--72-8kca8deo3b.xn--p1ai 72охрана.рф
xn----etbfovibhfeyw1i.xn--p1ai энерго-профит.рф
xn--80at1au5a.xn--p1ai кушаю.рф
xn--80ahdnnn1c9c5a.xn--p1ai дельфания.рф
xn--72-6kce7dfhb.xn--p1ai прораб
xn--b1abfbatzelscdle0o.xn--p1ai энергоремсервис.рф


And lots of exploit kit activity in the .xn--p1ai space as well -- here are some samples from a network of BlackHole and Cool hosts and redirectors:

IDN (Punycode) IDN (Unicode)
xn--66-6kc8bflgfdz.xn--p1ai уралпром
xn--90abjn3alia1k.xn--p1ai веброссия.рф
xn--74-6kcpfqrd3bj.xn--p1ai лидерпак
xn-----wlcj0b.xn--p1ai т-и-к.рф
xn---55-iddegw2b1j.xn--p1ai деньги
xn--b1aeckkexe3a.xn--p1ai плитковед.рф

(Note that we wondered if we'd spotted an "Easter Egg" type of message in that last example: it looks like "black exe" -- which is appropriate for an exkit host, don't you think? But it was probably just random chance.)


Part 5b -- IDNs and Botnets

For this part of the research, we went looking for evidence of DGA names in the IDN space. (DGAs, or Domain Generation Algorithms, are often used by botnet clients to generate large lists of random/junk domain names, which they try to contact for updates and insructions.

Broadly speaking, there are two classes of DGAs: random-character and random-word. We began our hunt by focusing on the second type...

We pulled lists of IDN domains in a wide variety of languages, and then tasked members of the WebPulse team who speak those languages to scan the list and tell us if the domain names looked believable, or if they consisted of "random words stuck together". Specifically, we checked the most common languages represented in our IDN logs:

  • Chinese
  • Japanese
  • Korean
  • Russian (and some other Cyrillic languages)
  • Arabic
  • Hebrew
  • A mix of the major Western European languages

In every case, the reviewers reported that the domain names did not seem strange or random. So much for DGA Type 2. How about DGA Type 1?

Well, here we fared better. Check out this batch of domains:

꿨ര潘ഫ 隨প䋀ত 脠ତ堀ܴ 차ӎ彐ӌ 纸ம嫰ய
棸չ畘θ 邐क欸ҭ 雘ર⪸મ ⤀ం죰ఁ 뱀ඖ訰ऊ
äӹ厰ӥ 燸ҝ䓠۠ 䏈Ꮨ鈐ഒ 苘ʃ鰸դ 䠐ቑ㻀ଉ
稈ֶ튘ү 毨ा奘ा 屈ፑ蚰ፆ ♨ѹ䈸ѻ 邐ദ焀վ
퐀֊㭰պ 革ಭ漘ಬ äп뫨ޮ 攘գ⍈գ 紨ե髰ե

(And so on...)

This family had over 750 different domain names show up in our logs. The volume of names, and the randomness of the Unicode characters, sure made it look like the results of a botnet DGA. Likewise, the fact that these were rarely observed resolving to an IP address is consistent with a DGA scenario.

The randomness of the Unicode is illustrated by one of our favorite examples: äહἀιᏭ (, which combines characters from the following blocks: Latin, Gujarati, Greek and Coptic, and Cherokee. (There really aren't too many Cherokee domains or web pages on the Internet these days...)


However, the traffic volume was very low, and this runs contrary to expectations for a botnet. So we began considering another alternative: that this could be part of a clever command-and-control architecture for a more advanced or targeted threat. The low traffic volume supports this hypothesis, but the exotic nature of the character sets works against it: a stealthy C&C scheme probably wouldn't want to call attention to itself with such weird domain names. (Unless the controllers were counting on people simply skipping over the "xn--" domain names in their logs....)

A second major consideration against this being a targeted attack is that we saw a wider spread of user licenses than we would expect to see in a targeted scenario. So we kept puzzling...


Eventually, we decided to ask for help, and circulated our findings among other researchers who belong to a couple of mailing lists focusing on security and malware issues. We got back a few suggestions, but none of them panned out. Meaning we're still stuck. So, if anyone out there has traffic to domain names like these in your logs, we'd be very interested in working with you to see if we can find an investigative thread to follow!


Part 6a -- Summary and Recommendations: Brand Abuse

First of all, we expect the overall growth in the use of IDNs to continue, and probably to continue accelerating. (Our numbers for the first couple of months of 2014 bear this out.)

Second, we expect to see growth in the use of IDNs in the subdomain space. (Recall that over 90% of the encounters with punycoded domain labels are at the level of the main domain.) We are already seeing IDN subdomains show up on Tumblr, for example, and we expect that the free host domains of the world have even less strict rules and oversight on the use of Unicode characters in punycoded subdomains that their users want to register than the registrars have for parent domains.

Third, keep in mind the factor of an on-going surge in the number of new TLDs available to people -- all of these spaces will likely allow IDNs.

Fourth, don't forget "bitsquatting" attacks.

These factors combined mean an exponential explosion in the possible "brand abuse" scenarios available to the Bad Guys. In the past, a Major Brand Name who was concerned about "typo squatter" attacks could reasonably consider the approach of simply registering all (or most) of the "look-alike" domain names out there. (And in .com, .net, .info, .biz, and .org varieties; and also in the major country codes...)

Okay, so maybe that never was such a reasonable approach.


The key point is to realize that in a Unicode world, with lots of new TLDs, and typo-squatting and bitsquatting techniques, you simply can't register your way out of trouble.

You have to focus on an "early detection" model, using a broad pool of traffic, to spot domain variants of your Brand Name as soon as they come into use, so that you can take action.


Part 6b -- Summary and Recommendations: Security Posture

First, it's time to double-check your infrastructure for IDN support, in both Unicode and Punycode forms. (Punycode, by design, shouldn't be breaking anything, as its form is entirely ASCII.)

Second, begin investigating options for displaying IDNs in both forms, Unicode as well as Punycode, in all of your major security logs. In many languages, the punycode form of the IDN will be completely meaningless to someone who's reviewing the logs, and it's not workable to have them constantly cutting-and-pasting those "xn--" names into a punycode converter to see what the actual domain name looked like. (You may need to apply some judicious pressure on your security vendors...)

Third, look for mis-matched Unicode character blocks. As we showed in several of the examples, this is an easy way to flag suspicious IDNs.

Finally, if you're the sort of high-security organization whose lack of business in Russia has allowed you to entirely block the .SU and (maybe) the .RU domains, you will also want to consider blocking all .XN--P1AI domains, which are at least as risky as the first two. There are certainly plenty of normal/safe domains in there, but a lot of evil/shady ones as well.


--C.L. and TvdH

Share this: