WARNING: Using this technique for anything other than effectively small bits of ascii art is a terrible idea. See my note at the bottom of the page. Basically screenreaders read these symbols as if they're mathematical symbols rather than "styles". I'm gonna leave this page up as it was some interesting research, but heed this warning.
I woke up this morning with a really dumb idea. Ya know those weird hacky unicode "fonts"? ๐๐๐๐ค ๐ค๐ ๐ฃ๐ฅ๐ ๐ฅ๐๐๐๐. I was wondering if there's one of those that makes it look like sฬถtฬถrฬถiฬถkฬถeฬถtฬถhฬถrฬถoฬถuฬถgฬถhฬถ or uฬฒnฬฒdฬฒeฬฒrฬฒlฬฒiฬฒnฬฒeฬฒ. I found this site called yaytext, it's a jank early 2000's site, but WOW they have a lot of weird unicode "styles".
After a bit of research, I discovered there are two broad methods for creating these styles. Diacritics and character remaps.
Diacritics
Diacritics are marks placed, above, below, or inside a character to indicate a certain pronunciation. You probably use them all the time: resumรฉ, naรฏve, crรจme, etc. Unicode is a little complicated, but you can think about text at 3 basic levels of abstraction.
Bytes
ASCII characters are 1 byte, but most code-points are represented with several bytes in Unicode:
Hello, ไธ็
'H' = 72
'e' = 101
'l' = 108
'l' = 108
'0' = 111
',' = 44
' ' = 32
'ไธ' = 19990 = E4, B8, 96
'็' = 30028 = E7, 95, 8C
Code Points
A code point is a single 32 bit unicode value. Unicode was created to collect characters, accents, symbols, control codes, emojis, and so on from all the world's languages and writing systems and assign each one a number. The most simple encoding of Unicode is UTF-32 or UCS-4 which is to simply use a full 32 bit number for every code-point. However, most common code-points require a number smaller than 65,536 (2 bytes) and very seldom are code-points used requiring more than 3 bytes. So a variable length encoding is more desireable than a fixed one like UTF-32.
UTF-8 is one of the many great innovations to come out of Plan 9. It represents every Unicode code-point with 1-4 bytes. The high-order bits of the first byte indicate how many bytes will follow. It's a pretty clever, but not too clever system. These days Unicode and UTF-8 have become the defacto standard.
Characters
Unfortunately, code-points aren't quite the end of the story. Take the character ร for example. It can be represented directly as the code-point U+00E0 or by "combining" the code-point 'a' U+0061 followed by a grave accent 'โฬ' U+0300. These combining code-points are called diacritics and there's technically no limit to the amount you use. Ever seen tอฬอฬฐeฬ อฬอxอฬฑฬฎฬญtฬอฬ ฬฉ ฬออชฬ lอญฬฬฎฬ iออญฬอkฬอฬบอeอฆออฬด ฬฝอฬฑฬtฬฬอฬhอซอจอ ฬiฬขฬฬฬซsฬนอฬงอ before? (Depending on your font and font renderer that may or may not display properly). Each "starting" code-point in that text was followed by 4 random diacritical code points. This technique is called Zalgo.
A character is a bit hard to define, but usually it goes like this. You can seperate code-points into two groups, starting and non-starting, and consider a "character" to be a starting code-point followed by 0 or more non-starting code-points. You can have differing code point values that are considered technially equal like the above example of U+00E0 vs U+0061,U+0300 or by having multiple diacriticals in differing orders.
Underlines and Strikethrough
That yaytext website uses diacritical U+0336 and U+0332 to create the strikethough and underline text respectively. Trying to paste the text directly in your terminal will probably show you just that.
Remaps
The next technique for creating styled text is to simply remap certain code-points to other, more obscure code-points. Unicode has a massive amount of symbols including these ๐ค๐๐๐๐ ๐๐ก๐๐๐๐ ๐๐๐๐๐๐๐ ๐๐๐๐ (that isn't markup or anything try copying and pasting them) which were originally added for math variables. There are actually loads of these character sets:
- ๐๐ผ๐น๐ฑ๐ฆ๐ฎ๐ป๐
- ๐๐จ๐ฅ๐๐๐๐ซ๐ข๐
- ๐๐ต๐ข๐ญ๐ช๐ค๐๐ข๐ฏ๐ด
- ๐ผ๐ก๐๐๐๐๐๐๐๐๐
- ๐ฝ๐ค๐ก๐๐๐ฉ๐๐ก๐๐๐๐๐ฃ๐จ
- ๐ฉ๐๐๐ ๐ฐ๐๐๐๐๐๐บ๐๐๐๐
- ๐ป๐ ๐ฆ๐๐๐
- ๐๐๐๐๐พ๐โฏ
- ๐๐ฏ๐๐จ๐ฑ๐ฒ๐ฏ
While messing with this today I wrote a little go module and cli tool which I
properly named fuckery. The cli tool reads
from STDIN and writes to STDOUT with a selected "style". I added those two
diacritical modes, the remaps above, and even configurable Zalgo style.
Naturally, I went into my gemgen project from
last week which generates gemtext from markdown.... and I implemented my fuckery
library with the -E
option. I mostly did this as a joke. It would be a
terrible idea to use this in practice -- especially until screenreaders are
capable of dealing with this. On my machine terminal browsers such as
amfora or gui browsers like
lagrange seem to render the bold and
italic "fonts" perfectly. Strikethrough and underline work in lagrange, but not
my terminal. Double, cursive, and fraktur work in my terminal, but not lagrange
ยฏ\_(ใ)_/ยฏ
update: Using these hacky unicode "fonts" is terrible for accesibility. I have been seeing these fonts and styling on lots of gemini capsules and it would be harmful for them to be used more often. Ultimately it's a common and old technique and literally built into the unicode standard -- long term I think the ideal solution is for gemini screenreaders to be aware of this and convert the text back to ascii if it falls outside of a codeblock. In the meantime I encourage other gemini authors to NOT USE THESE techniques in their posts other than in ascii art codeblocks which add visuals (and thus already must be ignored by screenreaders -- speaking of which make sure to tag ascii art as ascii art).