A homograph attack is based on standards of modern Internet that allow to create (and display in web browsers) URLs with characters from various language sets (with non-ASCII letters). Different languages may contain different but very similar characters. Attackers can register their own domain names that are similar to the existing web addresses. Then they can create their own websites that are, again, the same or very similar to the existing original sites (that usually belong to banks, corporations, email or news services). The phony websites are used for stealing data from users who happened to visit them.
Simple homograph attacks
In the simplest version of such attacks, a fake URL may consist only of simple ASCII alphanumeric characters. The intruder uses symbols that are similar to each other. Often the letter q may be confused with g, or o with 0.
Such URLs may fool some less experienced users:
The ability to use non-English characters in ULR addresses was added in 2003, due to the increasing number of non-English-speaking people that were using Internet. The change allowed to register and use domain names that could have been understood by a much larger number of interested people. Thus it became possible to create web addresses that were combinations of ASCII and non-ASCII characters, or addresses that consisted only of national symbols:
All non-Latin addresses need to be encoded in a special way to be handled by DNS servers. This format is known as Punycode and all browsers translate non-ASCII URLs into Punycode in the background before performing a DNS lookup. A Punycode domain name always starts from xn-- and then contains ASCII characters of the original address followed by encoded Unicode data. For instance, the latter address from the example above will be encoded in the following form:
Such domain names that contain letters from different alphabets are called Internationalized Domain Names (IDNs). They are handled in various ways by different web browsers. Usually every producer implements his own algorithms for determining the display format of requested URLs and usually one of two solutions (with some minor modifications) is preferred:
- Display all URL characters using Unicode, or
- Display all URL characters using Unicode if and only if all the characters belong to the same language that is chosen by user settings; display Punycode URL otherwise.
Homograph attacks using non-ASCII characters
Different languages with characters encoded in a different way, may contain some letters that look the same or at least very similar. Therefore it is possible to create URLs that consist of different characters but are indistinguishable to the human eye.
For example, Latin and Cyrillic alphabets contain a couple of letters that look the same but have completely different meaning and are encoded in a different way:
- (in Latin) a: U+0061, (in Cyrillic) а: U+0430
- (in Latin) c: U+0063, (in Cyrillic) с: U+0441
- (in Latin) p: U+0070, (in Cyrillic) р: U+0440
Given 100 000 characters supported by Unicode (many of which look alike), the intruder has great potential for creating various fake URLs and even the most careful users may be confused. At present neither DNS registrars, nor web browser vendors managed to prevent such attacks from happening.
Finally, it may be shown that an attacker can use a character that happens to look like the actual ASCII slash / (U+002F) - the mathematical division operator ∕ (U+2215). It allows him to set up a subdomain that looks like another real domain, using his own name server and a top-level domain. The fake URL address could look like the one below:
The character located after .com is a mathematical division operator. In a web browser's address bar the character would look like a common slash and the whole URL could be easily confused with the directory a-top-level-domain.com located in the root directory under the domain example.com:
The best protection against homograph attacks seems to be provided by warning or proper handlings such phony addresses by web browsers. Unfortunately this is not always the case and, what is more, the behaviour may differ depending on browser vendor.