PIP 0112 - Unicode support for Prolog
[Note: If you have comments please post them at the Prolog Community Discourse for this PIP]
1. Motivation
ISO Prolog (ISO/IEC 13211-1) specifies the Prolog source syntax in ASCII. Modern programs routinely use Unicode in atoms, strings, identifiers, and even operators, but the way each implementation extends the source syntax to Unicode is ad-hoc, undocumented, or implementation-specific. This proposal sketches a coherent, common direction for Unicode handling in standard Prolog, drawing on implementation experience from SWI-Prolog and Ciao Prolog and on the wider ecosystem of languages that have addressed the same questions (Python, Rust, Julia, C#, Swift, Ada).
The proposal covers ten areas:
- Source-text identifier syntax (UAX #31).
- Tokenisation of symbol and punctuation characters (solo vs. glueing).
- Paired delimiters: Unicode bracket
pairs (
Ps/Pe) and quote pairs (Pi/Pf) as standalone term and string syntaxes. - Whitespace.
- Escape sequences for code points (
\u,\U). - Code-point semantics of
atom_codes/2. - Numbers: ASCII at the source level,
any Unicode
Ndblock at runtime conversion, with same-block-per-number and ASCII syntax characters. - Grapheme clusters and string-level Unicode operations.
- Column-based stream positioning and column-aware
format/2. - Pluggable Unicode normalisation for the term reader and writer (opt-in, off by default).
The proposal does not require unconditional NFC normalisation of every atom — Prolog atoms double as byte-faithful identifiers and as containers for arbitrary text, so silently rewriting them is unsafe (see §4.1). Confusable detection (UTS #39) is out of scope.
2. Background: Unicode terminology
UAX #31 ("Unicode Identifier and Pattern Syntax") defines the properties most relevant here:
ID_Start,ID_Continue— base identifier sets, derived from General Category.XID_Start,XID_Continue— the same sets, modified to be closed under NFKC normalisation. UAX #31 recommendsXID_*for new language designs.Pattern_White_Space— an immutable, deliberately small set of whitespace code points:U+0009..U+000D, U+0020, U+0085, U+200E, U+200F, U+2028, U+2029. Note thatU+00A0(NBSP) is not in this set.
Unicode general categories used below: Lu
(uppercase letter), Ll (lowercase letter),
Lt (titlecase letter), Lm
(modifier letter), Lo (other letter),
Nd (decimal digit), Nl (letter
number), No (other number),
Sm/Sc/Sk/So (symbols),
Pc/Pd/Ps/Pe/Pi/Pf/Po (punctuation).
3. Comparison: Ciao Prolog vs. SWI-Prolog
Both implementations classify each Unicode code point into a small syntax class. The data structures differ:
- Ciao Prolog
(
core/engine/unicode_gen/unicode_gen.c) uses an exclusive enum: every code point belongs to exactly one ofLAYOUT,LOWERCASE,UPPERCASE,DIGIT,SYMBOL,PUNCT,IDCONT,INVALID. - SWI-Prolog
(
src/Unicode/prolog_syntax_map.pl) uses a bitmask: each code point may carry any combination ofid_start,id_continue,uppercase,layout,symbol,solo,other,decimal.
The category-by-category mapping is summarised below; cells marked diff are deliberate divergences worth discussing in the working group.
| Unicode | Ciao Prolog | SWI-Prolog (current Unicode branch) |
|---|---|---|
| Zs, Zl, Zp | LAYOUT |
(not specifically classified — see Pattern_White_Space) |
| Cc bidi WS/S/B (TAB, LF, ...) | LAYOUT |
layout (via Pattern_White_Space) |
| Other Cc, Cf, Cs, Co, Cn | INVALID |
other |
| Sm, Sc, Sk, So | SYMBOL |
solo (diff: Ciao glues,
SWI does not) |
| Pc, Pd, Po | SYMBOL |
solo (diff) |
| Ps, Pe | SYMBOL |
bracket paired delimiter, see §4.3
(diff) |
| Pi, Pf | SYMBOL |
quote paired delimiter, see §4.3
(diff) |
| Lu (with XID_Start) | UPPERCASE |
id_start + uppercase |
| Lt (with XID_Start) | UPPERCASE |
id_start (diff: Ciao
starts variable, SWI starts atom) |
| Ll, Lo, Lm, Nl (with XID_Start) | LOWERCASE |
id_start |
| XID_Continue \ XID_Start | IDCONT |
id_continue |
| No | IDCONT (diff) |
other (not in identifier set) |
| Me | INVALID |
other |
| Nd | IDCONT |
id_continue + decimal |
The points where the two implementations actually disagree are:
No(superscript and subscript digits, fractions, Roman numerals, circled digits, ...): Ciao puts the whole category inIDCONT, SWI-Prolog explicitly does not — but extendsXID_Continuewith the digit-shaped subset (super- and subscript digits) only.Lt(titlecase letters): Ciao starts a variable, SWI-Prolog (in this proposal) starts an atom because the Lu-only uppercase rule is simpler and more predictable than the broader derivedUppercaseproperty.- Sm/Sc/Sk/So + Pc/Pd/Po: Ciao glues
runs of these into compound atoms (like ASCII
==); SWI-Prolog treats each as a solo atom (see §4.2). - Ps/Pe and Pi/Pf: Ciao glues these
together with other symbols. SWI-Prolog treats each pair
as a paired delimiter that produces a
'<open><close>'/1compound: brackets wrap a term, quotes wrap literal text per thedouble_quotesflag (see §4.3).
4. Recommendations
4.1 Identifier handling
- Use
XID_StartandXID_Continueas the base identifier set. These are the UAX #31-recommended sets; they exclude a small number of code points that are not closed under NFKC and are preferable for new language designs. (Most language standards written or revised after ~2010 useXID_*: Rust, Julia, C#, Swift, Ada.) - Allow super- and subscript digits as
identifier continuation, by an explicit profile
addition to
XID_Continue:U+00B2, U+00B3, U+00B9, U+2070, U+2074..U+2079(superscripts) andU+2080..U+2089(subscripts). This permits variables of the formX²,Y₁, which are common in mathematical and physical code. Julia is the precedent. - Determine variable vs. atom from the
start character: a variable starts with
_or with a code point in general categoryLu. Everything else identifier-like starts an atom. This is simpler than reading the derivedUppercaseproperty and means titlecase letters (Lt) start atoms. - Do not normalise atoms automatically on every
creation path. Prolog atoms double as
byte-faithful identifiers and as containers for arbitrary
text (filenames, network input, JSON keys). Silently
rewriting them on every
atom_codes/2,atom_concat/3, or stream read would be a leaky abstraction. Make normalisation an explicit, opt-in operation: standard library predicates of the formunicode_nfc/2,unicode_nfd/2,unicode_nfkc/2,unicode_nfkd/2,unicode_nfkc_casefold/2, plus a per-call option on the term reader and force-quoting of denormalised text by the writer (see §4.9). Implementations may useutf8proc(a small, MIT-licensed C library originating in Julia) for the underlying transformation.
4.2 Symbols
- Unicode symbols and non-bracket / non-quote
punctuation form solo atoms. General categories
Sm,Sc,Sk,So,Pc,Pd,Poeach form a single-code-point atom; they do not glue with adjacent symbols. ASCII keeps its existing behaviour so that established operator tokens (==,=..,:-,\+, ...) continue to parse. - Operators built from Unicode symbols are
declared with
op/3. Math-heavy code can declare≤,≥,∈,∪,∩,→, etc. as ordinary operators. Implementations that wish to ship a default operator table for common mathematical symbols may do so as a library, not as part of the core grammar.
The opening / closing punctuation
(Ps/Pe) and initial / final
quotation classes (Pi/Pf) are
not solo; they are paired delimiters with their
own syntax described in §4.3.
4.2.1
Round-trip safety: pattern_syntax_solo
The classification of
Sm/Sc/Sk/So/Pc/Pd/Po
is derived from live general categories, so it grows with
each Unicode release (see §5, open question 3). A
canonical-form writer that relies on these categories may
emit text that becomes ambiguous or unreadable if a future
Unicode version reclassifies one of the code points.
To keep written terms stable across Unicode upgrades,
the writer recognises a pattern_syntax_solo
option that quotes any single- character atom whose code
point is not in the immutable UAX #31
Pattern_Syntax set. Multi-character atoms,
atoms built from identifier or symbol characters, and
atoms whose only character is in
Pattern_Syntax are unaffected.
?- write_canonical(+). % U+002B, Pattern_Syntax
+
?- write_canonical(€). % U+20AC, not Pattern_Syntax
'€'
?- write_canonical('🎉'). % U+1F389, not Pattern_Syntax
'🎉'
?- write_term('€', [quoted(true)]).
€
?- write_term('€', [quoted(true), pattern_syntax_solo(true)]).
'€'
write_canonical/1 enables the option by
default; writeq/1 and the unquoted-friendly
default of write_term/2 do not, so existing
programs that round-trip through non-canonical output keep
their current behaviour.
code_type(C, pattern_syntax) /
char_type(C, pattern_syntax) expose
membership for inspection.
This is the writer-side counterpart to open question 3 in §5: the reader retains the broad-categories surface (post-4.1 currency, emoji, dingbats all work as solo input) while the canonical writer restricts to the immutable subset.
4.3 Paired delimiters: brackets and quotes
The opening / closing punctuation classes
Ps/Pe and the initial / final
quotation classes Pi/Pf are
paired delimiters. An opening character followed
by content and the matching closing character reads as a
unary compound whose functor is the two delimiter code
points joined, generalising {Term} ⇒
'{}'(Term) to the full Unicode pair set.
Brackets wrap a Prolog term; quotes wrap
literal text.
4.3.1 Brackets
(Ps/Pe)
?- read_term_from_atom('〈foo, bar〉', T, []).
T = '〈〉'((foo, bar)).
?- read_term_from_atom('⟦x+y⟧', T, []).
T = '⟦⟧'(x+y).
The content is parsed as a Prolog term — operators are
honoured, nesting works, and the ASCII bracket behaviour
({T} ⇒ '{}'(T)) falls out as a
special case. The pair table is sourced from Unicode
BidiMirroring.txt filtered to
Ps/Pe (about 60 pairs in Unicode
17, including angle, corner, ceiling, floor, mathematical,
ornamental, fullwidth and CJK brackets).
4.3.2 Quotes
(Pi/Pf)
?- read_term_from_atom('«hello, world»', T, []).
T = '«»'("hello, world").
?- read_term_from_atom('“abc”', T, []).
T = '“”'("abc").
The content is treated as literal
text, with the same escape- sequence support as
ASCII quoted strings (\n, \t,
\uXXXX, \UXXXXXXXX, ...). The
contained value is converted to the type selected by the
double_quotes flag — string by default, also
atom, codes, or chars — and wrapped in the
'<open><close>'/1 compound. The
pair table comes from the Pi/Pf
entries of BidiMirroring.txt (eight pairs
covering «», ‹›, and a handful
of Supplemental Punctuation marks) plus the standard
left/right curly pairs U+2018/U+2019 (‘’) and
U+201C/U+201D (“”), which have
Bidi_Mirrored=No and are absent from
BidiMirroring.txt.
4.3.3
code_type/2 accessors
The classification predicates expose paren and quote relationships:
?- char_type('⟨', paren(C)).
C = '⟩'.
?- char_type(O, paren('」')).
O = '「'.
?- char_type('«', quote(C)).
C = '»'.
?- char_type('"', quote(C)).
C = '"'. % ASCII quotes have Close = Char.
paren(Close) covers ASCII (),
[], {} plus every Unicode
Ps/Pe pair.
quote(Close) covers the ASCII quotes
', ", ` (where
Close equals Char) plus every
Unicode Pi/Pf pair. Both
directions of the mapping are reversible.
The classification predicates also expose
pattern_syntax for direct membership tests
against the immutable UAX #31 set (see §4.2.1):
?- code_type(0x002B, pattern_syntax). % '+'
true.
?- code_type(0x20AC, pattern_syntax). % '€'
false.
4.3.4 Error semantics
Mismatched closes (e.g. «hello]) and
unmatched opens (e.g. «hello without a
closing ») raise syntax_error/1.
There is no fallback to a standalone solo atom; an opening
delimiter never appears in a bare term position.
4.4 White space
- Layout characters are exactly
Pattern_White_Space(UAX #31 R3a). The set is small and immutable, so a tokenizer written against it today behaves identically under any future Unicode release. U+00A0(NBSP) is not whitespace. It is deliberately excluded fromPattern_White_Space. Programs that paste from word processors will occasionally encounter NBSP in the wrong place; reporting it as a stray character is the right behaviour.
4.4.1 Line termination
Of the eleven Pattern_White_Space code
points, seven are line-terminator-like and end a line of
source text:
U+000A LF U+000B VT U+000C FF U+000D CR
U+0085 NEL U+2028 LINE SEPARATOR U+2029 PARAGRAPH SEPARATOR
Conformant implementations should recognise this same set wherever a line-ending matters in source text:
%line comments terminate on any of the seven (so a comment that containsU+0085ends at the NEL, not at the next ASCII LF).- The source-position line counter increments on each.
- Backslash-newline continuation in quoted
strings — the form
\<EOL><blank>*— accepts any of the seven as the<EOL>. The\<U+0085>and\<U+2028>forms behave the same as\<LF>.
The remaining four Pattern_White_Space
members — U+0020 SPACE and the bidi marks
U+200E LRM and U+200F RLM — are
layout but not line-enders.
User code can ask for either set via
char_type/2 / code_type/2:
prolog_layout— the elevenPattern_White_Spacecode points (the parser's notion of layout).prolog_end_of_line— the seven line-terminator-like members of that set.end_of_line— kept at the original ISO/POSIX definition, the four ASCII control codes LF, VT, FF, CR. Code that needs the wider set should useprolog_end_of_line.
The prolog_* prefix follows the existing
convention for parser-specific predicates
(prolog_var_start,
prolog_atom_start,
prolog_identifier_continue,
prolog_symbol).
4.4.2 Stray characters in source text
In token-start position — i.e. wherever layout is
allowed — any code point that is not in
one of the recognised syntax classes (layout, decimal,
identifier-start, identifier-continue, solo, bracket open,
quote open) raises
syntax_error(illegal_character). Concretely,
the following are stray:
- C0 / C1 controls, surrogates, and unassigned code
points (general category
Cc/Cs/Cn). - Noncharacter code points (
U+FDD0..U+FDEF,U+FFFE,U+FFFF, and the analogous endpoints in higher planes). - The
Zs/Zl/Zpseparator classes that are not inPattern_White_Space: NBSPU+00A0, OGHAM SPACE MARKU+1680, the various typographic spacesU+2000..U+200A, NARROW NO-BREAK SPACEU+202F, MEDIUM MATHEMATICAL SPACEU+205F, IDEOGRAPHIC SPACEU+3000, the line/paragraph separators insideZl/Zpnot already in Pattern_White_Space, etc. Cfformat characters not inPattern_White_Spaceand not inOther_ID_Continue: SOFT HYPHENU+00AD, ZERO WIDTH SPACEU+200B, the variation selectors, ...- Enclosing combining marks (
Me). No"other-number" code points outside the explicit super- / subscript-digit profile (vulgar fractions, Roman numerals, circled / parenthesised digits, ...).
Non-spacing combining marks (Mn,
Mc) are likewise stray at token-start
position — they cannot start a token — but are in
XID_Continue, so they absorb into a preceding
identifier.
4.4.3 Inside quoted material
Inside single-quoted atoms, double-quoted strings,
back-quoted text, the new Unicode quote pairs (§4.3.2),
% line comments, and /* ... */
block comments, any Unicode scalar value
is accepted verbatim. The surrogate range
U+D800..U+DFFF is not part of the scalar set
and is never reachable: well-formed UTF-8 cannot encode a
surrogate (RFC 3629 §3) and implementations
must treat any byte sequence that decodes
to one as malformed input — replaced by
U+FFFD REPLACEMENT CHARACTER — rather than
silently propagating an invalid code point. This matches
every comparable language — Python, Rust, Swift, Go,
JavaScript, Java, C++23, Haskell, OCaml, ... — and keeps
atom_codes/2 symmetric with the source-text
reader: any list of scalars that
atom_codes(A, Codes) will accept must
round-trip through
writeq(A) ⇒ read_term/2.
The escape sequences \uXXXX and
\UXXXXXXXX (§4.5) exist for portability and
explicit clarity, never as the only way to embed a code
point. The single exception to byte-faithful acceptance is
the bidirectional override / isolate range
(U+202A..U+202E and
U+2066..U+2069), which is rejected as a
Trojan-source defense (CVE-2021-42574). The writer is
responsible for force-quoting atoms whose content includes
zero-width or otherwise visually- unstable code points
(see §4.11) so that writeq round-trips
faithfully.
4.5 Escape sequences
\uXXXX— exactly four hexadecimal digits, denoting a Unicode scalar value in the BMP.\UXXXXXXXX— exactly eight hexadecimal digits, denoting a Unicode scalar value up toU+10FFFF.
These are the de facto convention across Unicode-aware languages (C, C++, Rust, Python, JSON, Java, ...). Most Prolog implementations already accept them; this proposal merely standardises the form.
4.6 atom_codes/2
The list argument of atom_codes/2 (and
analogously string_codes/2,
char_code/2, atom_chars/2) is a
list of Unicode scalar values, that is,
integers in 0..0x10FFFF excluding
0xD800..0xDFFF. UTF-16 surrogate halves never
appear; a single Unicode scalar is one list element
regardless of UTF-8/UTF-16 encoding details. This rules
out implementations that expose UTF-8 byte sequences via
this predicate, which leaks an encoding choice into the
language.
The same scalar-value rule applies symmetrically to every API that moves character data into or out of Prolog. Implementations must:
- Reject surrogate inputs. Predicates
that take a character code (
char_code/2,atom_codes/2,atom_chars/2,string_codes/2,string_chars/2,put_code/[1,2],put_char/[1,2], the~cformat directive, ...) must raisetype_error(character_code, Code)for any integer in0xD800..0xDFFF(and likewise for integers outside0..0x10FFFF). Symmetric: ifatom_codes(A, Cs)would rejectCs, thenwriteq(A) ⇒ read_termmust not be able to produce it either. - Reject surrogate outputs. Stream
output of a surrogate scalar (e.g.
put_code(0xD800)on a UTF-8 stream) must raise a representation error rather than emit malformed bytes (the UTF-8 three-byte sequenceED A0 80is forbidden by RFC 3629). - Sanitise surrogate-encoding inputs.
UTF-8 and UTF-16 byte-level decoders (file streams,
string_bytes/3, command-line argv, filenames, ...) must treat surrogate-encoding sequences as malformed and substituteU+FFFD. A surrogate scalar must never reach atom storage even if the underlying bytes encode one.
Together these rules give a single invariant: no Prolog-visible character ever has a surrogate scalar value, independent of how the underlying text reached the runtime.
4.7 Numbers
Two layers, with different rules:
- Source text
(
read_term/[2,3]). Numeric literals use ASCII digits0..9only. Non-ASCIINdcode points may appear inside identifiers (where they areid_continue) but cannot start a number or extend an ASCII number. This keeps the source-level grammar single-script and matches the UAX #31 R6 recommendation for programming-language number syntax. - Runtime conversion
(
atom_number/2,number_codes/2,number_chars/2,number_string/2). These accept decimal digits from any UnicodeNdblock (Devanagari, Eastern Arabic, Fullwidth, ...). Two constraints apply:- All digits in a single number must come from
the same block. That includes the integer part
and the fractional part of a float, the mantissa and
exponent of a float, and the numerator and denominator of
a rational.
१२३.४५parses;१२३.45does not. - The non-digit syntax characters are always
ASCII. The sign (
+,-), the rational separator (ror/), the floating-point decimal point (.), and the exponent letter (eorE) must use their ASCII forms; the look-alikes−(U+2212),.(U+FF0E),E(U+FF25) are rejected.
- All digits in a single number must come from
the same block. That includes the integer part
and the fractional part of a float, the mantissa and
exponent of a float, and the numerator and denominator of
a rational.
Examples:
?- number_string(N, "+१२३"). N = 123.
?- number_string(N, "−१२३"). false. % U+2212 minus
?- number_string(F, "१२३.४५"). F = 123.45.
?- number_string(F, "१२३e५"). F = 1.23e7.
?- number_string(R, "१२३r४५"). R = 41 rdiv 15.
?- number_string(N, "1२"). false. % mixed blocks
The character-code form 0'<C> is
unaffected by the same-block rule: it produces the integer
code point of any single Unicode scalar
<C>. In source text the term reader
interprets the usual escape sequences (0'\n,
0'·, 0'\U0001F600); the runtime
conversion family treats 0'<C> as the
literal next code point without escape interpretation.
Combining marks and other multi- code-point sequences are
not representable as a single integer.
0x, 0o and 0b
radix literals stay ASCII-only; the radix sigil is ASCII
and the digits inside follow.
4.8 Graphemes
A grapheme cluster (UAX #29) is a
user-perceived character — a base letter plus its
combining marks, an emoji ZWJ sequence, a
regional-indicator pair, and so on. It can span multiple
code points and need not have a single-code-point
precomposed form. Prolog should provide a standard library
predicate atom_graphemes/2 (and its
string_graphemes/2 analogue) that relates an
atom to a list of single-grapheme atoms. This is what user
code wants when iterating over "characters" for display or
editing purposes; iterating over code points is rarely
what is intended.
4.9
Stream positions and format/2 columns
For text streams, line_position/2 and the
position field of
stream_property/2 should be measured in
display columns, not bytes or code
points. Combining marks contribute zero columns; CJK and
emoji "wide" characters contribute two. The same
definition governs alignment in format/2
directives ~t, ~|, and
~+, and is exposed to the C API as
PL_wcwidth(int code).
The width of each code point is derived from
Unicode property data —
EastAsianWidth.txt (UAX #11) plus the general
category — at table-build time and stored alongside the
syntax classifier in the same per-code-point byte. This is
locale-independent and identical on every supported
platform; the Unicode version that drove the table is
reported by the read-only flag
unicode_syntax_version (§4.9).
4.9.1 Relation to
POSIX wcwidth(3)
The return-value contract is the same as
POSIX
wcwidth(3)/<wchar.h>: −1
for non-printable, 0 for combining / zero-width, 1 for
normal, 2 for wide. Code that already uses
wcwidth against the POSIX convention drops in
unchanged.
The differences are deliberate:
- Locale-independent. POSIX
wcwidthis permitted to depend onLC_CTYPE; in theC/POSIXlocale on glibc, for instance,wcwidthreturns −1 for every non-ASCII code point.PL_wcwidth(c)is a pure function ofc. - Cross-platform identical. Every
supported platform — Linux, macOS, Windows, WASM — gives
the same answer. Notably,
wcwidthdoes not exist at all on standard Windows; SWI-Prolog used to fall back to Markus Kuhn'smk_wcwidth.c, and that reference table is now superseded by the per-code-point table emitted from current Unicode data. - Versioned. The width data is pinned
to the Unicode release reported by
unicode_syntax_version; querying the flag answers "which Unicode are these widths from?". POSIXwcwidthexposes no equivalent. - Full 32-bit code points. The argument
is
intso non-BMP characters (emoji, supplementary planes) work on Windows, wherewchar_tis 16-bit and a naive cast silently truncates. - East Asian Ambiguous → 1. UAX #11
A(Ambiguous) is rendered as one column. POSIX leaves this implementation-defined; some glibc CJK locales return 2. SWI-Prolog follows the Western convention to keep alignment portable across locales — Kuhn's reference does the same. - Cc/Cn/Cs. Control characters
(
Cc, including DEL and the C1 range) → −1. Unassigned (Cn) and surrogate (Cs) code points get the Unicode default per UAX #11 (1, or 2 in the CJK Ideograph default-W blocks). POSIXwcwidthis implementation-defined for these. - No
mk_wcwidth_cjkvariant. Kuhn's_cjkvariant bumps Ambiguous to 2; the SWI tree no longer carries it. Code that needs CJK-style ambiguous-as-wide rendering should layer that on top, e.g. by post-processing aunicode_property/2query foreast_asian_width(a).
In short, PL_wcwidth(c) is approximately
"POSIX wcwidth with a locale that always
tracks current Unicode and resolves Ambiguous as narrow".
The POSIX-compatible return-value convention preserves
backward compatibility for foreign code that previously
linked against mk_wcwidth.c.
4.10 Reporting the Unicode version
Implementations should expose the Unicode version that
drives the source-syntax classifier as a read-only Prolog
flag — proposed name unicode_syntax_version,
an atom such as '17.0.0'. This lets portable
code check at runtime which character set the parser was
built against. Implementations that bundle a separate
Unicode data set for normalisation, grapheme segmentation,
and property queries (e.g. via utf8proc)
should document that version separately through their
library API; cross-reference between the two.
4.11 Pluggable Unicode normalisation in reader and writer
NFC normalisation is genuinely useful for the term reader and writer, so the proposal recommends two opt-in mechanisms backed by a kernel callback that an external normalisation library plugs in to. The kernel itself ships no normalisation logic; it just provides the registration point and the policy.
Kernel hook. The kernel exposes a single function-pointer callback through the public C API:
typedef int (*PL_atom_normalize_t)(unsigned char *in, size_t *len);
PL_atom_normalize_t PL_atom_normalize_hook(PL_atom_normalize_t new);The hook normalises a UTF-8 byte sequence in place to
the canonical form chosen by the implementation (NFC in
our reference implementation). The result for NFC is
always shorter than or equal to the input, so the caller's
buffer is sufficient. The hook returns the new length
through *len; an application that needs
NFD/NFKC/NFKD continues to use the explicit library
predicates.
The reference implementation registers
utf8proc_map(... | UTF8PROC_STABLE | UTF8PROC_COMPOSE)
from a wrapper around utf8proc loaded as
library(unicode). When that library is not
loaded the kernel hook is NULL.
Four-mode policy. Reader-time
atom-content handling is one of four modes:
accept, nfc, error,
or reject. The mode applies to
unquoted atoms only; quoted atoms and string
literals are byte-faithful regardless of mode.
accept(default) — pass unquoted-atom bytes through verbatim.nfc— normalise to Unicode NFC before interning, so two source forms that are canonically equivalent yield the same atom. Requires the kernel normalisation hook; the implementation auto-loadslibrary(unicode)if no hook is yet registered. An unavailable library propagates the underlyingexistence_error(source_sink, library(unicode)).error— raisesyntax_error(non_nfc_atom)when an unquoted atom is not in NFC. Uses the kernel hook for an exact NFC test whenlibrary(unicode)is loaded; otherwise falls back to rejecting any code point withwcwidth(c) < 1(combining marks, zero-width and non-printable characters), which is conservative but accurate enough for source-code review and works without any Unicode-data dependency. This is the deliberate "ship a classifier, not a normaliser" path.reject— raisesyntax_error(non_ascii_atom)when an unquoted atom contains any non-ASCII code point. Independent oflibrary(unicode)entirely.
Configuration surface. The mode is
settable per-call via a unicode_atoms(Mode)
option to read_term/[2,3],
read_clause/[2,3], and open/4;
per-stream via the unicode_atoms property of
set_stream/2 and
stream_property/2; and globally via the
read/write Prolog flag unicode_atoms.
Per-call wins over per- stream wins over flag default. The
flag has an active setter: setting it to
nfc while no hook is registered triggers a
kernel- side load of the normalisation library,
propagating any
existence_error(source_sink, ...) raised by
use_module/1 if the library is unavailable.
The four values, their semantics, and the auto-load /
fallback behaviour are described in one place (the
unicode_atoms option of
read_term/[2,3]); the other sites
cross-reference it.
Writer NFC quoting. Under
quoted(true) (i.e.\ writeq/1,
portray_clause/1, etc.), the writer
additionally force-quotes atoms that contain at least one
Unicode code point with wcwidth(c) == 0
(combining marks, ZWJ, variation selectors). Such atoms
are visually surprising as bare identifiers, since the
combining mark normally attaches to the preceding base
character; quoting them makes the denormalisation visible
and ensures that writeq/read
round-trips produce comparable atoms on the receiving
side. This rule is independent of the kernel normalisation
hook — it relies only on wcwidth — so it
works in any implementation, normalisation library or
not.
The split between read-time normalisation and write-time NFC quoting is deliberate. Read-time normalisation requires a Unicode data table; quoted writing requires only a small wcwidth table. An implementation can ship the latter without committing to the former.
5. Open questions
The working group will need to decide:
Titlecase letters (
Lt). Variable-start (Ciao) or atom-start (this proposal)? The simplicity of "uppercase =Lu" weighs against the broader, but more surprising, derivedUppercaseproperty.Math symbols (
Sm). Solo (this proposal) or glueable? The strongest argument for glueing is mathematical Prolog (constraint languages, theorem provers) where compact composite operators are useful. The strongest argument for solo is consistency with the rest of the symbol/punctuation classes and the fact thatop/3already provides the alternative.Pattern_Syntaxboundary for solo. UAX #31 R3a recommends that syntax characters come from the immutablePattern_Syntaxproperty.Pattern_Syntaxwas frozen at Unicode 4.1 (2005) and carries a stability guarantee: members never leave, non-members never join. This proposal's solo classes (bracket,quote, andsoloproper) are derived from live general categories (Ps/Pe,Pi/Pf,Sm/Sc/Sk/So/Pc/Pd/Po) and so grow with each Unicode release.Concretely, against Unicode 17.0:
code points Pattern_Syntax2 760 This proposal: solo∪bracket∪quote9 473 Pattern_Syntax⊂ proposal? (almost)yes, except U+2E2F proposal \ Pattern_Syntax(would become stray)6 747 The asymmetry matters:
Pattern_Syntaxis essentially a subset of the proposal, so tightening would only remove code points from the syntactic surface — not reclassify anything. The 6 747 code points that would move are:- 5 973
So— most non-letter symbols, including all emoji (U+1F600 grinning face, U+1F4A9 pile of poo, U+1F4C8 chart with upwards trend, …) and post-2005 dingbats / arrows. - 513
Po— script-specific punctuation added or extended after 2005 (Armenian apostrophe, Mongolian soft hyphen, Devanagari signs, …). - 123
Sk— modifier symbols. - 59
Sc— currency. Notably every currency added since 2005: ₹ (Indian Rupee, U+20B9), ₺ (Turkish Lira, U+20BA), ₽ (Ruble, U+20BD), ₿ (Bitcoin, U+20BF), ⃀ (Som, U+20C0). - 57
Sm— math symbols added post-4.1. - 22 other — mostly
Pd/Pc(U+005F_isPcbut kept as identifier-start regardless, see §4.1).
Pros of tightening to
Pattern_Syntax:- Parser semantics frozen forever; future Unicode revisions never change which characters are syntactic.
- Direct compliance with UAX #31 R3a.
- Smaller, more auditable surface.
Cons of tightening:
- New currency, math, and emoji symbols added by future
Unicode versions are not automatically usable as
operator characters. The post-2005 currency block alone
shows the cost: every currency sign younger than the Euro
(which is in
Pattern_Syntax) would be a stray character. - Visible printable symbols rejected as "stray" surprise users more than unknown control / format characters do.
- Breaks the rule "anything Unicode classifies as symbol or punctuation is solo," which is the rule users will guess from the general-category names.
Hybrid options worth considering:
Pattern_Syntax∪ post-4.1Sc: keep currency forward- compatible (the most likely class to appear in real source), freeze the rest.- General categories for solo, but
Pattern_Syntaxfor the stray-character diagnostic: accept anyS*/P*as solo, but only flag a code point as stray if it is neither inPattern_Syntaxnor in an identifier / number / layout class. This sidesteps the trade-off — the parser keeps the broad "everything looks like a symbol" rule, while the diagnostic gates on the immutable property. - Unicode-version flag: pin the live
general-category snapshot to a Prolog flag
(
unicode_syntax_version, already proposed in §4.10) so an implementation can promise stability per release without permanently freezing the surface.
The current proposal is the broad-categories option, on the grounds that SWI-Prolog already pins a Unicode version (§4.10) and rejects strays explicitly (§4.4.2), so the value
Pattern_Syntaxadds — version-independent immutability — is already partly delivered through other mechanisms. The write-side concern (a canonical form whose interpretation might shift between Unicode versions) is addressed separately by thepattern_syntax_solooption of write_term/2; see §4.2.1.- 5 973
Curated Pi/Pf pairs. The standard left/right curly quotes U+2018/U+2019 (
‘’) and U+201C/U+201D (“”) haveBidi_Mirrored=Noand so are absent fromBidiMirroring.txt; §4.3.2 adds them by hand. Other Pi/Pf pairs that share the same property — e.g. CJK-shaped quotation marks — are not currently curated. Whether and how to extend the curated list is a working- group call.Confusables (UTS #39). Out of scope for this proposal. Implementations may surface confusable detection through a linter or library; building it into the grammar is more aggressive than is warranted.
6. Reference implementations
- utf8proc (https://github.com/JuliaStrings/utf8proc,
MIT licence) — small (~200 KB compiled, single C source
plus generated tables), provides NFC/NFD/NFKC/NFKD, case
folding,
XID_*membership, general category, grapheme break, display width. Used by Julia, PostgreSQL, and SWI-Prolog (packages/utf8proc/). - SWI-Prolog
Unicodebranch — this proposal's design is implemented end-to-end on themasterbranch of https://github.com/SWI-Prolog/swipl-devel. - Ciao Prolog —
core/engine/unicode_gen/shows the alternative, exclusive-enum approach.
7. Acknowledgements
Daniel Lundin and the utf8proc maintainers provided the C library underlying the SWI-Prolog implementation. The wider language-design literature (Python's PEP 3131, Rust RFC 2457, Julia's identifier rules, UAX #31 itself) informs the recommended defaults.
Copyright
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

