Cleanup lexical structure of numbers and identifiers

This proposal cleanups and clarifies lexical structure of numbers and identifiers.

Motivation

The Haskell2010 report specifies lexical structure of numbers and identifiers as follows.

First character classes:

small          ascSmall | uniSmall | _
ascSmall       a | b |  | z
uniSmall       any Unicode lowercase letter

large          ascLarge | uniLarge
ascLarge       A | B |  | Z
uniLarge       any uppercase or titlecase Unicode letter

digit          ascDigit | uniDigit
ascDigit       0 | 1 |  | 9
uniDigit       any Unicode decimal digit
octit          0 | 1 |  | 7
hexit          digit | A |  | F | a |  | f

then numeric literals as

decimal         digit{digit}
octal           octit{octit}
hexadecimal     hexit{hexit}

and identifiers as

varid      (small {small | large | digit | ' })⟨reservedid⟩
conid      large {small | large | digit | ' }

There are two problems:

These issues are already partially fixed but not fully nor documented.

Some people read the report, try out things in GHC, and it doesn’t work, as this StackOverflow question shows. The answer should be in the manual, not only in the source code.

Proposed Change Specification

A short summary: the collection of alphanumerical characters is divided into four groups:

  1. Large (upper) letters

  2. Small letters

  3. 0-9 digits

  4. The rest

First three start conid, varid and decimal tokens respectively. The rest cannot appear as “first” character of any token. All four can be used as trailing characters in the identifiers.

More precisely:

Extend the small character class to allow scripts without small/large character distinction (see Other Letter)

uniSmall       any Unicode lowercase letter or Other Letter

Introduce two new character groups, uniIdchar and idchar

uniIdchar    any Unicode Modifier Letter or NonSpacingMark
idchar       small | large | digit | uniIdchar | '

Change identifiers to

varid      small {idchar} reservedid
conid      large {idchar}

and numbers

digit          ascDigit | uniDigit
ascDigit       0 | 1 |  | 9
uniDigit       any Unicode Decimal Number, Letter Number or Other Number -- change
octit          0 | 1 |  | 7
hexit          ascDigit | A |  | F | a |  | f  -- digit to ascDigit

decimal        ascDigit{ascDigit}  -- digit to ascDigit
octal          octit{octit}
hexadecimal    hexit{hexit}

Additionally, the graphic token (which is used in rules for character and string literals) is extended with the new uniIdchar:

graphic    small | large | symbol | digit | uniIdchar | special | " | '

And the GHCs $pragmachar, which doesn’t appear in the report:

$pragmachar = [$small $large $digit $uniidchar ]

The two truly new changes are abandoning the idea of “decimal digit” commented with a ToDo in GHC’s Lexer.x (there would be just ascii digits and all others number characaters) and adding the Letter Number category to the uniDigit class (Other Number is already there). In the graphic token GHC already allows Letter Numbers, as that token is parsed manually and not by its Alex rule (this is performance optimization).

With these change all Unicode general categories are assigned in GHC Haskell lexical structure; from (edited) Lexer.x:

case generalCategory c of
  UppercaseLetter       -> upper
  LowercaseLetter       -> lower
  TitlecaseLetter       -> upper
  ModifierLetter        -> uniidchar -- see #10196
  OtherLetter           -> lower -- see #1103
  NonSpacingMark        -> uniidchar -- see #7650
  SpacingCombiningMark  -> other_graphic
  EnclosingMark         -> other_graphic
  DecimalNumber         -> digit
  LetterNumber          -> digit -- this proposal, previously other_graphic
  OtherNumber           -> digit -- see #4373
  ConnectorPunctuation  -> symbol
  DashPunctuation       -> symbol
  OpenPunctuation       -> other_graphic
  ClosePunctuation      -> other_graphic
  InitialQuote          -> other_graphic
  FinalQuote            -> other_graphic
  OtherPunctuation      -> symbol
  MathSymbol            -> symbol
  CurrencySymbol        -> symbol
  ModifierSymbol        -> symbol
  OtherSymbol           -> symbol
  Space                 -> space
  _other                -> non_graphic

Examples

The

Prelude> yearⅯⅯ= 2000

<interactive>:3:5: error: lexical error at character '\8559'

doesn’t work in current GHC. With proposed change it will:

ghci> yearⅯⅯ= 2000
ghci> yearⅯⅯ
2000

Using Letter Number as an identifier will continue to be disallowed:

ghci> ⅯⅯ = 2000

<interactive>:6:1: error: lexical error at character '\8559'

Also Decimal Numbers cannot be used in numeric literals

ghci> ٥

<interactive>:10:1: error: lexical error at character '\1637'

This is the current, undocumented GHC behaviour which deviates from the language report. There any Unicode decimal digit is valid character in integer token (for example).

Effect and Interactions

This proposal documents changes from

and fixes

Numeric Underscores

This proposal doesn’t interfere with numeric underscores. While the corresponding proposal specifies the change as

-decimal     →  digit{digit}
+decimal     →  digit{numSpacer digit}

it is in practice:

-decimal     →  ascDigit{ascDigit}
+decimal     →  ascDigit{numSpacer ascDigit}

so there is no conflict.

Costs and Drawbacks

The development costs are minimal, the code patch is inline Obviously we need to add tests and update the documentation too. The $decdigit token can be completely removed in favour of $ascdigit, but that results in slightly bigger diff.

--- a/compiler/GHC/Parser/Lexer.x
+++ b/compiler/GHC/Parser/Lexer.x
@@ -128,7 +128,7 @@ $tab         = \t

 $ascdigit  = 0-9
 $unidigit  = \x03 -- Trick Alex into handling Unicode. See [Unicode in Alex].
-$decdigit  = $ascdigit -- for now, should really be $digit (ToDo)
+$decdigit  = $ascdigit -- exactly $ascdigit, no more no less.
 $digit     = [$ascdigit $unidigit]

 $special   = [\(\)\,\;\[\]\`\{\}]
@@ -144,17 +144,17 @@ $unismall  = \x02 -- Trick Alex into handling Unicode. See [Unicode in Alex].
 $ascsmall  = [a-z]
 $small     = [$ascsmall $unismall \_]

+$uniidchar = \x07 -- Trick Alex into handling Unicode. See [Unicode in Alex].
+$idchar    = [$small $large $digit $uniidchar \']
+
 $unigraphic = \x06 -- Trick Alex into handling Unicode. See [Unicode in Alex].
-$graphic   = [$small $large $symbol $digit $special $unigraphic \"\']
+$graphic   = [$small $large $symbol $digit $idchar $special $unigraphic \"\']

 $binit     = 0-1
 $octit     = 0-7
 $hexit     = [$decdigit A-F a-f]

-$uniidchar = \x07 -- Trick Alex into handling Unicode. See [Unicode in Alex].
-$idchar    = [$small $large $digit $uniidchar \']
-
-$pragmachar = [$small $large $digit]
+$pragmachar = [$small $large $digit $uniidchar ]

 $docsym    = [\| \^ \* \$]

@@ -2434,7 +2434,7 @@ adjustChar c = fromIntegral $ ord adj_c
                   SpacingCombiningMark  -> other_graphic
                   EnclosingMark         -> other_graphic
                   DecimalNumber         -> digit
-                  LetterNumber          -> other_graphic
+                  LetterNumber          -> digit
                   OtherNumber           -> digit -- see #4373
                   ConnectorPunctuation  -> symbol
                   DashPunctuation       -> symbol

None of GHC own tests failed with this change.

Alternatives

Should LetterNumber be small? Then it could start an varid, for example

  :: Int
  = 12

Letter Numbers are letter like. We can then also argue that Other Numbers should also be able to appear as a leading varid character.

Having Decimal Numbers sans 0-9 parsed as small is yet another option. Agda goes that far, but it is a very lexically liberal language.

Alternatively Decimal Numbers should be allowed in numeric literals, as report specifies. Maybe only with UnicodeSyntax extension enabled though. If Decimal Numbers cannot lead identifier tokens, this wont cause language fork.

Relatedly, we may ask why Other Letter are considered small, and not just idchar (i.e caseless character). This was an arbitrary choice made 14 years ago, see GHC issue #1103.

Again, this proposal makes conservative choice and doesn’t propose any change there.

There are also ideas more comprehensive lexical overhaul of the language (e.g. https://github.com/blamario/rfcs/blob/unicode-identifers/0000-unicode-identifers.rst) but they are a lot more controversial.