Chris Weber: Unraveling Unicode
About: The complex landscape of Unicode provides many angles for exploiting software and end users. We've known about some of these for years, we've seen buffer overflows exploited because of faulty Unicode handling, and we've seen homograph attacks in URL's. However, the real mysteries remain latent, unapparent to most software developers and even to the security community. I'm going to raise awareness around the interesting attack vectors and new areas of research into Unicode, as well as open people's eyes to the modern Visual Spoofing attacks of today.
This talk will include demonstrations of several uncommon vulnerabilities/attack vectors, and will also include a tool release to assist in finding these issues. A separate Spoof-detection component will also be released to demonstrate how we can defend users against Visual Spoofing attacks. We'll take a close technical look at many of the issues in Unicode software which are not well-known even in the security research community:
* How Unicode characters can be mishandled to take on powerful formatting properties such as white space.
* When unexpected UTF-8 sequences can lead to over-consumption and character deletion which enable attacks such as cross-site scripting and file system manipulation.
* What happened to non-shortest form UTF-8 and UTF-7?
* Why best-fit mappings lurking in common frameworks and API's will enable drastic misbehavior and attacks within your applications, allowing for control over file systems and interpreters/parsers such as HTML.
* When casing operations enable a special character to be converted into something useful for cross-site scripting and other attacks.
* Why normalization operations can enable a Latin Modifier character to be converted into an exploitable HTML greater than sign.
* How normalization and casing operations can expand a single character by up to 18x leading to buffer overflows.
* Why the BOM and Mongolian Vowel Separator are great inputs to use in test cases.
* How Internationalized Domain Names work and why they're still vulnerable to Visual Spoofing attacks today.
This presentation's intention is to educate the audience on categorized security issues around Unicode and Internationalized software in a clear and structured way, while giving real-world test cases, inputs, and practices for finding and avoiding vulnerabilities. I'll also cover the visual security issues relating to script spoofing and the 'confusables'. Internationalized Domain Names have been with us since 2003 yet are less understood in the security community. Internationalized top-level-domains are coming up, as are email addresses. I'll be demonstrating how I can fool end users with lookalikes and homograph attacks in modern browsers with common .COM and .ORG domains.
Unicode is a universal character encoding providing the basis for processing, storage, and interchange of text data in any language in all modern software. Unicode replaces the myriad of historical character sets and encodings which have proven cumbersome and difficult for interoperability. With Unicode we get a single unified model for representing characters in almost any language past, present, and even future.