Unicode Tomfoolery

Wednesday, Aug 11, 2021
#gemini #tool

WARNING: Using this technique for anything other than effectively small bits of ascii art is a terrible idea. See my note at the bottom of the page. Basically screenreaders read these symbols as if they’re mathematical symbols rather than “styles”. I’m gonna leave this page up as it was some interesting research, but heed this warning.


I woke up this morning with a really dumb idea. Ya know those weird hacky unicode “fonts”? ๐•‹๐•™๐•š๐•ค ๐•ค๐• ๐•ฃ๐•ฅ๐•’ ๐•ฅ๐•™๐•š๐•Ÿ๐•˜. I was wondering if there’s one of those that makes it look like sฬถtฬถrฬถiฬถkฬถeฬถtฬถhฬถrฬถoฬถuฬถgฬถhฬถ or uฬฒnฬฒdฬฒeฬฒrฬฒlฬฒiฬฒnฬฒeฬฒ. I found this site called yaytext, it’s a jank early 2000’s site, but WOW they have a lot of weird unicode “styles”.

After a bit of research, I discovered there are two broad methods for creating these styles. Diacritics and character remaps.

Diacritics

Diacritics are marks placed, above, below, or inside a character to indicate a certain pronunciation. You probably use them all the time: resumรฉ, naรฏve, crรจme, etc. Unicode is a little complicated, but you can think about text at 3 basic levels of abstraction.

Bytes

ASCII characters are 1 byte, but most code-points are represented with several bytes in Unicode:

Hello, ไธ–็•Œ

'H' = 72
'e' = 101
'l' = 108
'l' = 108
'0' = 111
',' = 44
' ' = 32
'ไธ–' = 19990 = E4, B8, 96
'็•Œ' = 30028 = E7, 95, 8C

Code Points

A code point is a single 32 bit unicode value. Unicode was created to collect characters, accents, symbols, control codes, emojis, and so on from all the world’s languages and writing systems and assign each one a number. The most simple encoding of Unicode is UTF-32 or UCS-4 which is to simply use a full 32 bit number for every code-point. However, most common code-points require a number smaller than 65,536 (2 bytes) and very seldom are code-points used requiring more than 3 bytes. So a variable length encoding is more desireable than a fixed one like UTF-32.

UTF-8 is one of the many great innovations to come out of Plan 9. It represents every Unicode code-point with 1-4 bytes. The high-order bits of the first byte indicate how many bytes will follow. It’s a pretty clever, but not too clever system. These days Unicode and UTF-8 have become the defacto standard.

Characters

Unfortunately, code-points aren’t quite the end of the story. Take the character ร  for example. It can be represented directly as the code-point U+00E0 or by “combining” the code-point ‘a’ U+0061 followed by a grave accent ‘โ—Œฬ’ U+0300. These combining code-points are called diacritics and there’s technically no limit to the amount you use. Ever seen tอฬอ’ฬฐeฬ…อ‘ฬŽอ’xอŸฬฑฬฎฬญtฬŸอ–ฬ ฬฉ ฬ‚อ“อชฬ…lอญฬ†ฬฎฬ iออญฬ™อ‰kฬšอ‚ฬบอeอฆอ›อ‹ฬด ฬฝอ€ฬฑฬ’tฬ‹ฬ“อ—ฬ›hอซอจอ ฬ“iฬขฬ‰ฬ—ฬซsฬนอˆฬงอƒ before? (Depending on your font and font renderer that may or may not display properly). Each “starting” code-point in that text was followed by 4 random diacritical code points. This technique is called Zalgo.

A character is a bit hard to define, but usually it goes like this. You can seperate code-points into two groups, starting and non-starting, and consider a “character” to be a starting code-point followed by 0 or more non-starting code-points. You can have differing code point values that are considered technially equal like the above example of U+00E0 vs U+0061,U+0300 or by having multiple diacriticals in differing orders.

Underlines and Strikethrough

That yaytext website uses diacritical U+0336 and U+0332 to create the strikethough and underline text respectively. Trying to paste the text directly in your terminal will probably show you just that.

picture of my terminal when pasting strikethough text

Remaps

The next technique for creating styled text is to simply remap certain code-points to other, more obscure code-points. Unicode has a massive amount of symbols including these ๐‘ค๐‘’๐‘–๐‘Ÿ๐‘‘ ๐‘–๐‘ก๐‘Ž๐‘™๐‘–๐‘ ๐‘™๐‘œ๐‘œ๐‘˜๐‘–๐‘›๐‘” ๐‘œ๐‘›๐‘’๐‘  (that isn’t markup or anything try copying and pasting them) which were originally added for math variables. There are actually loads of these character sets:

While messing with this today I wrote a little go module and cli tool which I properly named fuckery. The cli tool reads from STDIN and writes to STDOUT with a selected “style”. I added those two diacritical modes, the remaps above, and even configurable Zalgo style. Naturally, I went into my gemgen project from last week which generates gemtext from markdown…. and I implemented my fuckery library with the -E option. I mostly did this as a joke. It would be a terrible idea to use this in practice – especially until screenreaders are capable of dealing with this. On my machine terminal browsers such as amfora or gui browsers like lagrange seem to render the bold and italic “fonts” perfectly. Strikethrough and underline work in lagrange, but not my terminal. Double, cursive, and fraktur work in my terminal, but not lagrange ยฏ\_(ใƒ„)_/ยฏ


update: Using these hacky unicode “fonts” is terrible for accesibility. I have been seeing these fonts and styling on lots of gemini capsules and it would be harmful for them to be used more often. Ultimately it’s a common and old technique and literally built into the unicode standard – long term I think the ideal solution is for gemini screenreaders to be aware of this and convert the text back to ascii if it falls outside of a codeblock. In the meantime I encourage other gemini authors to NOT USE THESE techniques in their posts other than in ascii art codeblocks which add visuals (and thus already must be ignored by screenreaders – speaking of which make sure to tag ascii art as ascii art).

rosenzweig’s post ew0k’s post