mdlbear: (technonerdmonster)
[personal profile] mdlbear

This post in Krebs on Security describes an unusual and potentially very dangerous attack technique that can be used to sneak evil code past code reviews and into the supply chain. Briefly, it allows evildoers to write code that looks very different to a human and a compiler. It should probably come as no surprise that it involves Unicode, the same coding standard that lets you make blog posts that include inline emoji, or mix text in English and Arabic.

In particular, it's the latter ability that the vulnerability targets, specifically Unicode's "Bidi" algorithm for presenting a mix of left-to-right and right-to-left text. (Read the Bidi article for details and examples -- I'm not going to try plopping random text in languages I don't know into the middle of a blog post.)

Now go read the "Trojan Source Attacks" website, and the associated paper [PDF] and GitHub repo. Observe, in particular, the Warning about bidirectional Unicode text that GitHub now attaches to files like this one in C++. Observe also that GitHub does not flag files that, for example, mix homoglyphs like "H" (the usual ASCII version) and "Н" (the similar-looking Cyrillic letter that sounds like "N"; how similar it looks depends on what font your browser is using). If you're unlucky, you might have clicked on a URL containing one or more of these, that took you someplace unexpected and almost certainly malicious.

The Trojan Source attack works by making use of the control characters U+202B RIGHT-TO-LEFT EMBEDDING (RLE) and U+202A LEFT-TO-RIGHT EMBEDDING (LRE), which change the base direction explicitly.

And remember: ШYSINAШYG - What You See Is Not Always What You've Got!

Resources

Another fine post from The Computer Curmudgeon (also at computer-curmudgeon.com).
Donation buttons in profile.

Date: 2021-11-05 05:20 am (UTC)
From: [personal profile] andyheninger
Yea, there are all sorts of crazy spoofing things you can do with Unicode. Or even with plain ascii, substituting 0 and O, or 1, l and I. Or rn for m, which with many fonts, especially at small sizes, are really hard to tell apart.

Unicode abuse isn't new; see https://unicode.org/reports/tr36/

Abusing bidi controls probably isn't any worse the others, and is likely easier for tooling to warn about.

One of the more clever attacks I saw involved malformed utf-8 data entered into form fields that were then injected into SQL. In utf-8, a code point character can take one to four bytes to represent, with the lead byte showing how many trailing bytes followed. The hack involved flawed error recovery - enter the lead for a four byte sequence, followed by a single trail byte, put that into a form, and watch the utf-8 decoding error recovery silently consume two more bytes, including the closing quote put around the user data by the application.

This one was a serious problem at one time; it's been addressed by standardizing how malformed sequences should be interpreted.


Most Popular Tags

Style Credit

Page generated 2026-01-06 12:19 pm
Powered by Dreamwidth Studios