unicode

Unicode Identifiers in Python 3

This post was inspired by an off-topic email chain on the python-ideas mailing list involving Steven D'Aprano, Chris Angelico, and me.

Back in the dark ages, names for variables in Python (up to and including version 2.7 that is) and many other programming could include ASCII letters, underscores, and, after the first character, ASCII numbers. Or, to put it as a regular expression: /^[A-Za-z_][A-Za-z0-9_]*$/.

Now, most humans don't lead very ASCII lives (ASCII is not even really suitable for writing English text). These days, with Unicode being practically universally adopted, this old requirement looks a bit daft, and thankfully, Python 3 introduced full unicode support for source code, including unicode in identifier names. Nowadays, this is perfectly valid Python:

>>> gänseblümchen = '🌼'
>>> print(gänseblümchen)
🌼

So what exactly is allowed in identifier names? The naïve but widespread assumption would be that you can use any Unicode letter, an underscore, and, after the first character, any Unicode number. This is why Steven D'Aprano was so surprised (as was I!) when he discovered that ‘℘’ is a valid Python identifier, but is not a letter, but a mathematical symbol!

>>>  = 1
>>> unicodedata.category('℘'), unicodedata.name('℘')
('Sm', 'SCRIPT CAPITAL P')

In actual fact, ‘℘’ is the only mathematical symbol that can be used this way in Python. What's going on here?

A close reading of the specification in PEP 3131 reveals that instead of simply allowing Unicode letters and numbers, Python uses NFKC normalization and refers to the character properties XID_Start and XID_Continue as defined by the Unicode standard.

The standard defines XID_Start to include all letters (Lu, Ll, Lt, Lm, Lo), all letter-numbers (Nl), and everything with a mysterious property called Other_ID_Start, minus some things that are syntax-like or whitespace like, plus/minus some technicalities. XID_Continue adds numbers and some more technicalities.

So, with all said and done, what are the exceptions to the simplistic ‘identifiers can start with letters’ rule-of-thumb? Let's find out!

Python 3.7.0a0 (heads/master:21c2dd7, Jun  4 2017, 15:18:26) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> all_unicode = map(chr, range(0x110000))
>>> for c in all_unicode:
...     category = unicodedata.category(c)
...     if category.startswith('L') or category == 'Nl':
...         # Letters and letter-numbers should be OK
...         if not c.isidentifier():
...             print('NOT OK [{}] {} U+{:04X}  {}'.format(
...                   category, c, ord(c), unicodedata.name(c)))
...     else:
...         if c.isidentifier():
...             print('    OK [{}] {} U+{:04X}  {}'.format(
...                   category, c, ord(c), unicodedata.name(c)))
... 
    OK [Pc] _ U+005F  LOW LINE
NOT OK [Lm] ͺ U+037A  GREEK YPOGEGRAMMENI
NOT OK [Lo] ำ U+0E33  THAI CHARACTER SARA AM
NOT OK [Lo] ຳ U+0EB3  LAO VOWEL SIGN AM
    OK [Mn] ᢅ U+1885  MONGOLIAN LETTER ALI GALI BALUDA
    OK [Mn] ᢆ U+1886  MONGOLIAN LETTER ALI GALI THREE BALUDA
    OK [Sm] ℘ U+2118  SCRIPT CAPITAL P
    OK [So] ℮ U+212E  ESTIMATED SYMBOL
NOT OK [Lm] ⸯ U+2E2F  VERTICAL TILDE
NOT OK [Lo] ﱞ U+FC5E  ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM
NOT OK [Lo] ﱟ U+FC5F  ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM
NOT OK [Lo] ﱠ U+FC60  ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM
NOT OK [Lo] ﱡ U+FC61  ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM
NOT OK [Lo] ﱢ U+FC62  ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM
NOT OK [Lo] ﱣ U+FC63  ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM
NOT OK [Lo] ﷺ U+FDFA  ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
NOT OK [Lo] ﷻ U+FDFB  ARABIC LIGATURE JALLAJALALOUHOU
NOT OK [Lo] ﹰ U+FE70  ARABIC FATHATAN ISOLATED FORM
NOT OK [Lo] ﹲ U+FE72  ARABIC DAMMATAN ISOLATED FORM
NOT OK [Lo] ﹴ U+FE74  ARABIC KASRATAN ISOLATED FORM
NOT OK [Lo] ﹶ U+FE76  ARABIC FATHA ISOLATED FORM
NOT OK [Lo] ﹸ U+FE78  ARABIC DAMMA ISOLATED FORM
NOT OK [Lo] ﹺ U+FE7A  ARABIC KASRA ISOLATED FORM
NOT OK [Lo] ﹼ U+FE7C  ARABIC SHADDA ISOLATED FORM
NOT OK [Lo] ﹾ U+FE7E  ARABIC SUKUN ISOLATED FORM
NOT OK [Lm] ゙ U+FF9E  HALFWIDTH KATAKANA VOICED SOUND MARK
NOT OK [Lm] ゚ U+FF9F  HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

That looks like quite a lot of exceptions! Some of these, like ﷺ, are made invalid by NFKC normalization (which I mentioned above). We can take a closer look at how exactly this affects the forbidden identifiers:

>>> all_unicode = map(chr, range(0x110000))
>>> for c in all_unicode:
...     category = unicodedata.category(c)
...     if category.startswith('L') or category == 'Nl':
...         if not c.isidentifier():
...             normform = unicodedata.normalize('NFKC', c)
...             print('NOT OK [{}] {} U+{:04X} → "{}" starts with [{}]'
...                   .format(category, c, ord(c), normform,
...                           unicodedata.category(normform[0])))
... 
NOT OK [Lm] ͺ U+037A → " ͅ" starts with [Zs]
NOT OK [Lo] ำ U+0E33 → "ํา" starts with [Mn]
NOT OK [Lo] ຳ U+0EB3 → "ໍາ" starts with [Mn]
NOT OK [Lm] ⸯ U+2E2F → "ⸯ" starts with [Lm]
NOT OK [Lo] ﱞ U+FC5E → " ٌّ" starts with [Zs]
NOT OK [Lo] ﱟ U+FC5F → " ٍّ" starts with [Zs]
NOT OK [Lo] ﱠ U+FC60 → " َّ" starts with [Zs]
NOT OK [Lo] ﱡ U+FC61 → " ُّ" starts with [Zs]
NOT OK [Lo] ﱢ U+FC62 → " ِّ" starts with [Zs]
NOT OK [Lo] ﱣ U+FC63 → " ّٰ" starts with [Zs]
NOT OK [Lo] ﷺ U+FDFA → "صلى الله عليه وسلم" starts with [Lo]
NOT OK [Lo] ﷻ U+FDFB → "جل جلاله" starts with [Lo]
NOT OK [Lo] ﹰ U+FE70 → " ً" starts with [Zs]
NOT OK [Lo] ﹲ U+FE72 → " ٌ" starts with [Zs]
NOT OK [Lo] ﹴ U+FE74 → " ٍ" starts with [Zs]
NOT OK [Lo] ﹶ U+FE76 → " َ" starts with [Zs]
NOT OK [Lo] ﹸ U+FE78 → " ُ" starts with [Zs]
NOT OK [Lo] ﹺ U+FE7A → " ِ" starts with [Zs]
NOT OK [Lo] ﹼ U+FE7C → " ّ" starts with [Zs]
NOT OK [Lo] ﹾ U+FE7E → " ْ" starts with [Zs]
NOT OK [Lm] ゙ U+FF9E → "゙" starts with [Mn]
NOT OK [Lm] ゚ U+FF9F → "゚" starts with [Mn]

The Arabic presentation forms end up including spaces. Many of the others are decomposed (due to the quirks of the ‘compatibility’ normalization NFKC, I assume) and turn into a combination of marks and spaces.

The odd one out, here, is U+2E2F, the ‘VERTICAL TILDE’. This character, presumably by virtue of being a tilde, is actually excluded from XID_Start due to it having the property Pattern_Syntax.

Now let's take a closer look at our ‘false negatives’, the non-letter characters that are allowed at the start of an identifier:

>>> all_unicode = map(chr, range(0x110000))
>>> for c in all_unicode:
...     category = unicodedata.category(c)
...     if not category.startswith('L') and category != 'Nl':
...         if c.isidentifier():
...             print('[{}] {} U+{:04X} {}'.format(
...                   category, c, ord(c), unicodedata.name(c)))
... 
[Pc] _ U+005F LOW LINE
[Mn] ᢅ U+1885 MONGOLIAN LETTER ALI GALI BALUDA
[Mn] ᢆ U+1886 MONGOLIAN LETTER ALI GALI THREE BALUDA
[Sm] ℘ U+2118 SCRIPT CAPITAL P
[So] ℮ U+212E ESTIMATED SYMBOL

The low line, or underscore, has always been used in Python identifiers and is explicitly permitted, but not required, by the Unicode identifier standard. The others all have the Unicode property Other_ID_Start because they used to be considered letters, but no longer are in Unicode 9.0.0. ᢅ U+1885 and ᢆ U+1886 were only changed to category Mn in Unicode 9.0.0; if you run the code above in Python versions before 3.7, they won't appear! ℘ U+2118 and ℮ U+212E were considered alphabetic until Unicode 2.0.14, but have been symbols since Unicode 3.0.0. Curiously, similar mathematical script symbols like ℒ U+2112 and ℬ U+212C have always been considered letters and compatibility-transform to L and B respectively under NFKC normalization. I really don't know what to make of this.

This article is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.