tjol.eu

  1. ↑ up ↑
  2. blog
  3. projects
  4. contact
  • Unicode Identifiers in Python 3

    This post was inspired by an off-topic email chain on the python-ideas mailing list involving Steven D’Aprano, Chris Angelico, and me.

    Here is a copy of the email that started it.

    Back in the dark ages, names for variables in Python (up to and including version 2.7 that is) and many other programming could include ASCII letters, underscores, and, after the first character, ASCII numbers. Or, to put it as a regular expression: /^[A-Za-z_][A-Za-z0-9_]*$/.

    Now, most humans don’t lead very ASCII lives (ASCII is not even really suitable for writing English text). These days, with Unicode being practically universally adopted, this old requirement looks a bit daft, and thankfully, Python 3 introduced full unicode support for source code, including unicode in identifier names. Nowadays, this is perfectly valid Python:

    >>> gänseblümchen = '🌼'
    >>> print(gänseblümchen)
    🌼Code language: Python (python)

    So what exactly is allowed in identifier names? The naïve but widespread assumption would be that you can use any Unicode letter, an underscore, and, after the first character, any Unicode number. This is why Steven D’Aprano was so surprised (as was I!) when he discovered that ‘℘’ is a valid Python identifier, but is not a letter, but a mathematical symbol!

    Unicode assigns every character a category; They are:

    • Letters (Lu, Ll, Lt, Lm, Lo), e.g. Ω (Lu)
    • Marks (Mn, Mc, Me), i.e. diacritics &c
    • Numbers (Nd, Nl, No), e.g. ٣ (Nd)
    • Punctuation (P…), e.g. ‽ (Po)
    • Symbols (S…), e.g. 𝄞 (So)
    • Separator (Z…), e.g. the space (Zs)
    • Other (C…), e.g. control characters
    >>> ℘ = 1
    >>> unicodedata.category('℘'), unicodedata.name('℘')
    ('Sm', 'SCRIPT CAPITAL P')Code language: Python (python)

    In actual fact, ‘℘’ is the only mathematical symbol that can be used this way in Python. What’s going on here?

    A close reading of the specification in PEP 3131 reveals that instead of simply allowing Unicode letters and numbers, Python uses NFKC normalization and refers to the character properties XID_Start and XID_Continue as defined by the Unicode standard.

    Significant changes to Python have been defined and documented in Python Enhancement Proposals (PEPs) since Python 2.0
    Among other things, NFKC normalization ensures that á is one letter rather than a letter (a) and a mark (◌́).

    The standard defines XID_Start to include all letters (Lu, Ll, Lt, Lm, Lo), all letter-numbers (Nl), and everything with a mysterious property called Other_ID_Start, minus some things that are syntax-like or whitespace like, plus/minus some technicalities. XID_Continue adds numbers and some more technicalities.

    So, with all said and done, what are the exceptions to the simplistic ‘identifiers can start with letters’ rule-of-thumb? Let’s find out!

    Note that I’m deliberately demonstrating this using unreleased Python 3.7 rather than a final version. It will work in older versions of Python 3, but the output will, interestingly enough, be different.

    Python 3.7.0a0 (heads/master:21c2dd7, Jun  4 2017, 15:18:26) 
    [GCC 5.4.0 20160609] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import unicodedata
    >>> all_unicode = map(chr, range(0x110000))
    >>> for c in all_unicode:
    ...     category = unicodedata.category(c)
    ...     if category.startswith('L') or category == 'Nl':
    ...         # Letters and letter-numbers should be OK
    ...         if not c.isidentifier():
    ...             print('NOT OK [{}] {} U+{:04X}  {}'.format(
    ...                   category, c, ord(c), unicodedata.name(c)))
    ...     else:
    ...         if c.isidentifier():
    ...             print('    OK [{}] {} U+{:04X}  {}'.format(
    ...                   category, c, ord(c), unicodedata.name(c)))
    ... 
        OK [Pc] _ U+005F  LOW LINE
    NOT OK [Lm] ͺ U+037A  GREEK YPOGEGRAMMENI
    NOT OK [Lo] ำ U+0E33  THAI CHARACTER SARA AM
    NOT OK [Lo] ຳ U+0EB3  LAO VOWEL SIGN AM
        OK [Mn] ᢅ U+1885  MONGOLIAN LETTER ALI GALI BALUDA
        OK [Mn] ᢆ U+1886  MONGOLIAN LETTER ALI GALI THREE BALUDA
        OK [Sm] ℘ U+2118  SCRIPT CAPITAL P
        OK [So] ℮ U+212E  ESTIMATED SYMBOL
    NOT OK [Lm] ⸯ U+2E2F  VERTICAL TILDE
    NOT OK [Lo] ﱞ U+FC5E  ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM
    NOT OK [Lo] ﱟ U+FC5F  ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM
    NOT OK [Lo] ﱠ U+FC60  ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM
    NOT OK [Lo] ﱡ U+FC61  ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM
    NOT OK [Lo] ﱢ U+FC62  ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM
    NOT OK [Lo] ﱣ U+FC63  ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM
    NOT OK [Lo] ﷺ U+FDFA  ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
    NOT OK [Lo] ﷻ U+FDFB  ARABIC LIGATURE JALLAJALALOUHOU
    NOT OK [Lo] ﹰ U+FE70  ARABIC FATHATAN ISOLATED FORM
    NOT OK [Lo] ﹲ U+FE72  ARABIC DAMMATAN ISOLATED FORM
    NOT OK [Lo] ﹴ U+FE74  ARABIC KASRATAN ISOLATED FORM
    NOT OK [Lo] ﹶ U+FE76  ARABIC FATHA ISOLATED FORM
    NOT OK [Lo] ﹸ U+FE78  ARABIC DAMMA ISOLATED FORM
    NOT OK [Lo] ﹺ U+FE7A  ARABIC KASRA ISOLATED FORM
    NOT OK [Lo] ﹼ U+FE7C  ARABIC SHADDA ISOLATED FORM
    NOT OK [Lo] ﹾ U+FE7E  ARABIC SUKUN ISOLATED FORM
    NOT OK [Lm] ゙ U+FF9E  HALFWIDTH KATAKANA VOICED SOUND MARK
    NOT OK [Lm] ゚ U+FF9F  HALFWIDTH KATAKANA SEMI-VOICED SOUND MARKCode language: Python (python)

    That looks like quite a lot of exceptions! Some of these, like ﷺ, are made invalid by NFKC normalization (which I mentioned above). We can take a closer look at how exactly this affects the forbidden identifiers:

    ﷺ
    is a presentation form of the phrase ‘صلى اللّٰه عليه وسلم’, peace be upon him, which is commonly used when referring to the prophet Muhammad. It turns out Arabic writing is a bit more colourful and fun than Latin.
    >>> all_unicode = map(chr, range(0x110000))
    >>> for c in all_unicode:
    ...     category = unicodedata.category(c)
    ...     if category.startswith('L') or category == 'Nl':
    ...         if not c.isidentifier():
    ...             normform = unicodedata.normalize('NFKC', c)
    ...             print('NOT OK [{}] {} U+{:04X} → "{}" starts with [{}]'
    ...                   .format(category, c, ord(c), normform,
    ...                           unicodedata.category(normform[0])))
    ... 
    NOT OK [Lm] ͺ U+037A → " ͅ" starts with [Zs]
    NOT OK [Lo] ำ U+0E33 → "ํา" starts with [Mn]
    NOT OK [Lo] ຳ U+0EB3 → "ໍາ" starts with [Mn]
    NOT OK [Lm] ⸯ U+2E2F → "ⸯ" starts with [Lm]
    NOT OK [Lo] ﱞ U+FC5E → " ٌّ" starts with [Zs]
    NOT OK [Lo] ﱟ U+FC5F → " ٍّ" starts with [Zs]
    NOT OK [Lo] ﱠ U+FC60 → " َّ" starts with [Zs]
    NOT OK [Lo] ﱡ U+FC61 → " ُّ" starts with [Zs]
    NOT OK [Lo] ﱢ U+FC62 → " ِّ" starts with [Zs]
    NOT OK [Lo] ﱣ U+FC63 → " ّٰ" starts with [Zs]
    NOT OK [Lo] ﷺ U+FDFA → "صلى الله عليه وسلم" starts with [Lo]
    NOT OK [Lo] ﷻ U+FDFB → "جل جلاله" starts with [Lo]
    NOT OK [Lo] ﹰ U+FE70 → " ً" starts with [Zs]
    NOT OK [Lo] ﹲ U+FE72 → " ٌ" starts with [Zs]
    NOT OK [Lo] ﹴ U+FE74 → " ٍ" starts with [Zs]
    NOT OK [Lo] ﹶ U+FE76 → " َ" starts with [Zs]
    NOT OK [Lo] ﹸ U+FE78 → " ُ" starts with [Zs]
    NOT OK [Lo] ﹺ U+FE7A → " ِ" starts with [Zs]
    NOT OK [Lo] ﹼ U+FE7C → " ّ" starts with [Zs]
    NOT OK [Lo] ﹾ U+FE7E → " ْ" starts with [Zs]
    NOT OK [Lm] ゙ U+FF9E → "゙" starts with [Mn]
    NOT OK [Lm] ゚ U+FF9F → "゚" starts with [Mn]Code language: Python (python)

    The Arabic presentation forms end up including spaces. Many of the others are decomposed (due to the quirks of the ‘compatibility’ normalization NFKC, I assume) and turn into a combination of marks and spaces.

    The odd one out, here, is ⸯ U+2E2F, the ‘VERTICAL TILDE’. This character, presumably by virtue of being a tilde, is actually excluded from XID_Start due to it having the property Pattern_Syntax.

    Now let’s take a closer look at our ‘false negatives’, the non-letter characters that are allowed at the start of an identifier:

    >>> all_unicode = map(chr, range(0x110000))
    >>> for c in all_unicode:
    ...     category = unicodedata.category(c)
    ...     if not category.startswith('L') and category != 'Nl':
    ...         if c.isidentifier():
    ...             print('[{}] {} U+{:04X} {}'.format(
    ...                   category, c, ord(c), unicodedata.name(c)))
    ... 
    [Pc] _ U+005F LOW LINE
    [Mn] ᢅ U+1885 MONGOLIAN LETTER ALI GALI BALUDA
    [Mn] ᢆ U+1886 MONGOLIAN LETTER ALI GALI THREE BALUDA
    [Sm] ℘ U+2118 SCRIPT CAPITAL P
    [So] ℮ U+212E ESTIMATED SYMBOLCode language: Python (python)

    The low line, or underscore, has always been used in Python identifiers and is explicitly permitted, but not required, by the Unicode identifier standard. The others all have the Unicode property Other_ID_Start because they used to be considered letters, but no longer are in Unicode 9.0.0. ᢅ U+1885 and ᢆ U+1886 were only changed to category Mn in Unicode 9.0.0; if you run the code above in Python versions before 3.7, they won’t appear! ℘ U+2118 and ℮ U+212E were considered alphabetic until Unicode 2.0.14, but have been symbols since Unicode 3.0.0. Curiously, similar mathematical script symbols like ℒ U+2112 and ℬ U+212C have always been considered letters and compatibility-transform to L and B respectively under NFKC normalization. I really don’t know what to make of this.

    Creative Commons License This article is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

    Thomas Jollans

    2017-06-04
    Blog

    You can reply to this post using Mastodon.

Contact • Copyright • Privacy