This post was inspired by an off-topic email chain on the python-ideas mailing list involving Steven D’Aprano, Chris Angelico, and me.
Back in the dark ages, names for variables in Python (up to and including version 2.7 that is) and many other programming could include ASCII letters, underscores, and, after the first character, ASCII numbers. Or, to put it as a regular expression: /^[A-Za-z_][A-Za-z0-9_]*$/
.
Now, most humans don’t lead very ASCII lives (ASCII is not even really suitable for writing English text). These days, with Unicode being practically universally adopted, this old requirement looks a bit daft, and thankfully, Python 3 introduced full unicode support for source code, including unicode in identifier names. Nowadays, this is perfectly valid Python:
>>> gänseblümchen = '🌼'
>>> print(gänseblümchen)
🌼
Code language: Python (python)
So what exactly is allowed in identifier names? The naïve but widespread assumption would be that you can use any Unicode letter, an underscore, and, after the first character, any Unicode number. This is why Steven D’Aprano was so surprised (as was I!) when he discovered that ‘℘’ is a valid Python identifier, but is not a letter, but a mathematical symbol!
>>> ℘ = 1
>>> unicodedata.category('℘'), unicodedata.name('℘')
('Sm', 'SCRIPT CAPITAL P')
Code language: Python (python)
In actual fact, ‘℘’ is the only mathematical symbol that can be used this way in Python. What’s going on here?
A close reading of the specification in PEP 3131 reveals that instead of simply allowing Unicode letters and numbers, Python uses NFKC normalization and refers to the character properties XID_Start
and XID_Continue
as defined by the Unicode standard.
The standard defines XID_Start
to include all letters (Lu, Ll, Lt, Lm, Lo), all letter-numbers (Nl), and everything with a mysterious property called Other_ID_Start
, minus some things that are syntax-like or whitespace like, plus/minus some technicalities. XID_Continue
adds numbers and some more technicalities.
So, with all said and done, what are the exceptions to the simplistic ‘identifiers can start with letters’ rule-of-thumb? Let’s find out!
Python 3.7.0a0 (heads/master:21c2dd7, Jun 4 2017, 15:18:26)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> all_unicode = map(chr, range(0x110000))
>>> for c in all_unicode:
... category = unicodedata.category(c)
... if category.startswith('L') or category == 'Nl':
... # Letters and letter-numbers should be OK
... if not c.isidentifier():
... print('NOT OK [{}] {} U+{:04X} {}'.format(
... category, c, ord(c), unicodedata.name(c)))
... else:
... if c.isidentifier():
... print(' OK [{}] {} U+{:04X} {}'.format(
... category, c, ord(c), unicodedata.name(c)))
...
OK [Pc] _ U+005F LOW LINE
NOT OK [Lm] ͺ U+037A GREEK YPOGEGRAMMENI
NOT OK [Lo] ำ U+0E33 THAI CHARACTER SARA AM
NOT OK [Lo] ຳ U+0EB3 LAO VOWEL SIGN AM
OK [Mn] ᢅ U+1885 MONGOLIAN LETTER ALI GALI BALUDA
OK [Mn] ᢆ U+1886 MONGOLIAN LETTER ALI GALI THREE BALUDA
OK [Sm] ℘ U+2118 SCRIPT CAPITAL P
OK [So] ℮ U+212E ESTIMATED SYMBOL
NOT OK [Lm] ⸯ U+2E2F VERTICAL TILDE
NOT OK [Lo] ﱞ U+FC5E ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM
NOT OK [Lo] ﱟ U+FC5F ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM
NOT OK [Lo] ﱠ U+FC60 ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM
NOT OK [Lo] ﱡ U+FC61 ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM
NOT OK [Lo] ﱢ U+FC62 ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM
NOT OK [Lo] ﱣ U+FC63 ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM
NOT OK [Lo] ﷺ U+FDFA ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
NOT OK [Lo] ﷻ U+FDFB ARABIC LIGATURE JALLAJALALOUHOU
NOT OK [Lo] ﹰ U+FE70 ARABIC FATHATAN ISOLATED FORM
NOT OK [Lo] ﹲ U+FE72 ARABIC DAMMATAN ISOLATED FORM
NOT OK [Lo] ﹴ U+FE74 ARABIC KASRATAN ISOLATED FORM
NOT OK [Lo] ﹶ U+FE76 ARABIC FATHA ISOLATED FORM
NOT OK [Lo] ﹸ U+FE78 ARABIC DAMMA ISOLATED FORM
NOT OK [Lo] ﹺ U+FE7A ARABIC KASRA ISOLATED FORM
NOT OK [Lo] ﹼ U+FE7C ARABIC SHADDA ISOLATED FORM
NOT OK [Lo] ﹾ U+FE7E ARABIC SUKUN ISOLATED FORM
NOT OK [Lm] ゙ U+FF9E HALFWIDTH KATAKANA VOICED SOUND MARK
NOT OK [Lm] ゚ U+FF9F HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
Code language: Python (python)
That looks like quite a lot of exceptions! Some of these, like ﷺ, are made invalid by NFKC normalization (which I mentioned above). We can take a closer look at how exactly this affects the forbidden identifiers:
>>> all_unicode = map(chr, range(0x110000))
>>> for c in all_unicode:
... category = unicodedata.category(c)
... if category.startswith('L') or category == 'Nl':
... if not c.isidentifier():
... normform = unicodedata.normalize('NFKC', c)
... print('NOT OK [{}] {} U+{:04X} → "{}" starts with [{}]'
... .format(category, c, ord(c), normform,
... unicodedata.category(normform[0])))
...
NOT OK [Lm] ͺ U+037A → " ͅ" starts with [Zs]
NOT OK [Lo] ำ U+0E33 → "ํา" starts with [Mn]
NOT OK [Lo] ຳ U+0EB3 → "ໍາ" starts with [Mn]
NOT OK [Lm] ⸯ U+2E2F → "ⸯ" starts with [Lm]
NOT OK [Lo] ﱞ U+FC5E → " ٌّ" starts with [Zs]
NOT OK [Lo] ﱟ U+FC5F → " ٍّ" starts with [Zs]
NOT OK [Lo] ﱠ U+FC60 → " َّ" starts with [Zs]
NOT OK [Lo] ﱡ U+FC61 → " ُّ" starts with [Zs]
NOT OK [Lo] ﱢ U+FC62 → " ِّ" starts with [Zs]
NOT OK [Lo] ﱣ U+FC63 → " ّٰ" starts with [Zs]
NOT OK [Lo] ﷺ U+FDFA → "صلى الله عليه وسلم" starts with [Lo]
NOT OK [Lo] ﷻ U+FDFB → "جل جلاله" starts with [Lo]
NOT OK [Lo] ﹰ U+FE70 → " ً" starts with [Zs]
NOT OK [Lo] ﹲ U+FE72 → " ٌ" starts with [Zs]
NOT OK [Lo] ﹴ U+FE74 → " ٍ" starts with [Zs]
NOT OK [Lo] ﹶ U+FE76 → " َ" starts with [Zs]
NOT OK [Lo] ﹸ U+FE78 → " ُ" starts with [Zs]
NOT OK [Lo] ﹺ U+FE7A → " ِ" starts with [Zs]
NOT OK [Lo] ﹼ U+FE7C → " ّ" starts with [Zs]
NOT OK [Lo] ﹾ U+FE7E → " ْ" starts with [Zs]
NOT OK [Lm] ゙ U+FF9E → "゙" starts with [Mn]
NOT OK [Lm] ゚ U+FF9F → "゚" starts with [Mn]
Code language: Python (python)
The Arabic presentation forms end up including spaces. Many of the others are decomposed (due to the quirks of the ‘compatibility’ normalization NFKC, I assume) and turn into a combination of marks and spaces.
The odd one out, here, is ⸯ U+2E2F
, the ‘VERTICAL TILDE’. This character, presumably by virtue of being a tilde, is actually excluded from XID_Start
due to it having the property Pattern_Syntax
.
Now let’s take a closer look at our ‘false negatives’, the non-letter characters that are allowed at the start of an identifier:
>>> all_unicode = map(chr, range(0x110000))
>>> for c in all_unicode:
... category = unicodedata.category(c)
... if not category.startswith('L') and category != 'Nl':
... if c.isidentifier():
... print('[{}] {} U+{:04X} {}'.format(
... category, c, ord(c), unicodedata.name(c)))
...
[Pc] _ U+005F LOW LINE
[Mn] ᢅ U+1885 MONGOLIAN LETTER ALI GALI BALUDA
[Mn] ᢆ U+1886 MONGOLIAN LETTER ALI GALI THREE BALUDA
[Sm] ℘ U+2118 SCRIPT CAPITAL P
[So] ℮ U+212E ESTIMATED SYMBOL
Code language: Python (python)
The low line, or underscore, has always been used in Python identifiers and is explicitly permitted, but not required, by the Unicode identifier standard. The others all have the Unicode property Other_ID_Start
because they used to be considered letters, but no longer are in Unicode 9.0.0. ᢅ U+1885
and ᢆ U+1886
were only changed to category Mn in Unicode 9.0.0; if you run the code above in Python versions before 3.7, they won’t appear! ℘ U+2118
and ℮ U+212E
were considered alphabetic until Unicode 2.0.14, but have been symbols since Unicode 3.0.0. Curiously, similar mathematical script symbols like ℒ U+2112
and ℬ U+212C
have always been considered letters and compatibility-transform to L and B respectively under NFKC normalization. I really don’t know what to make of this.