Gauche Devlog

< Generic file locking---is it really about files? | Two-argument reverse >

2011/07/18

Unicode Case-mapping Support and Character Set Independence

In my local branch, I've been working on enhancing case-mapping procedures such as char-upcase and string-downcase to cover the entire character range. I expect I can merge it into the master branch soon.

As of 0.9.1 Gauche's case mapping procedures only handles ASCII range, and pass through any characters beyond it. But R6RS and R7RS (draft) both define those case-mapping procedures in terms of Unicode definitions, so it's about time to modernize Gauche's character handling as well.

It isn't much trouble to generate tables from Unicode character database, although there's room to optimize them for speed and/or space (I'll post another entry to explain the strategy I took).

The issues

The trouble is that Gauche is designed to be independent from the internal character encoding scheme as much as possible. While the internal encoding needs to be chosen at compile-time, the code that depends on a specific encoding is small and mostly confined in gauche/char_*.h files.

Case-mapping adds nontrivial amount of code; it's not just a simple table of character-to-character mapping. In string case-mapping (e.g. string-downcase) each character's mapping is context-dependent, and to properly map the cases, we need to find word boundaries. That means we have to implement text segmentation, which is defined in UAX #29. That, in turn, depends on the character property database.

(The complication of this issue may be illustrated by this:

gosh> (string-downcase "ΧΑΟΣΧΑΟΣ.ΧΑΟΣ. Σ.")
"χαοσχαοσ.χαος. σ."

According to UAX#29 and R6RS, only the third sigma should take the final form.)

If I want to keep encoding-dependent code separate, I need to prepare those tables for each supported encodings and choose one at the compile time according to the configuration option of internal encoding. The tables can be generated mechanically, but having more code that selectively used by compiler options means more testing and maintenance headache. I don't like it.

Furthermore, full support of Unicode-defined algorithms such as text segmentation and normalization are useful even if Gauche's internal encoding is not utf-8. For example, if Gauche's internal encoding is euc-jp, then only a subset of Unicode characters are available for Scheme characters and strings, but such program can apply Unicode algorithm on a sequence of unicode codepoints (e.g. vector of integers). If I implement the algorithm anyway, there's not much point to limit access to it just because the characters aren't encoded in Unicode.

So, there are two issues regarding Gauche's character set independence and Unicode algorithms:

  • How much of the code should be separated as 'encoding-dependent' part? If I make it completely separated, it's architectually clean, but may become maintenance burden.
  • Should I make the algorithm itself available regardless of Gauche's internal encoding? There's no reason not to, but doing so means I need to implement algorithms independent from Gauche's internal character representation.

The compromise

I haven't found a clear-cut solution. So far my implementation became hybrid; I lost clear separation of dependence of internal character set. I had to add #ifdefs here and there to handle special cases for specific encodings, mostly because of optimization. That is, the efficient table configurations differ depending on the encoding, so the lookup code needs to be changed as well.

I also chose to generate some tables only for Unicode codepoints. If the internal encoding differ from Unicode, I convert the internal character code to Unicode codepoint to look up the tables. It penalizes EUC-JP and Shift_JIS encodings, but it has a merit of having one table and (mostly same) lookup code. I can also use the table for procedures that work on Unicode codepoint, not on Scheme characters; so that the algorithms can be used regardless of Gauche's internal encoding.

So far, the character general category table is the only table I generate for each internal encoding. The parser (read) needs to look it up, since the symbol constituent characters are defined in terms of character general categories (at least in R6RS). I don't want the symbol reader to convert each character just to determine the boundary.

For the rest of lookup tables, including case mapping, I only generate tables specialized to be looked up by Unicode codepoints. If the internal encoding is EUC-JP or Shift_JIS, the case mapping functions does conversion. The case conversion table has a certain structure that highly depends on the Unicode character distribution, so it needs some work to regenerate it for different encodings. If the performance becomes an issue I'll consider using separate tables.

(NB: The ASCII range are the same on all the supported encodings, so I shortcut if the given character is in the range. The encoding conversion only occur when the given character is outside of it. I might further improve the performance by excluding the range of the characters that is known to have no cases from looking up the table, for the case conversion routines is identity for such characters.)

CSI---is it still worth?

It would have been a lot simpler if I fixed the internal encoding to utf-8 (or other Unicode encodings). Single set of tables, and no separate API for characters and codepoints.

Unicode is not just a coded character set (CCS). It comes with a set of definitions of higher-level operations such as case conversions, text segmentations, normalizations, etc., defined on top of the CCS. More and more applications assume these high-level operations available.

Originally I decided to support choices of internal encodings since I wanted to keep options open; that is, I didn't want to be bound to a specific internal encodings. It's just a hunch that there might be some areas where alternative encodings would be beneficial. In fact it was useful, back in 2001, when almost all files I had, and my Emacs shell buffer, were in EUC-JP. And I was having headaches when any one of tools in the toolchain didn't handle Unicode well and caused wreak havoc.

These days most of tools just works with Unicode. That imaginary benefit margin of alternative encodings gets thinner, and if I penalize them by performance of higher-level character and string operations, whatever remaining margin would go away---it'd be just simpler and faster to convert everything to Unicode internally.

Well, many pro-Unicode people would say "I told you so." Yeah, I admit it. If I keep supporting the alternative encodings, at some point I'll need to reorganize the code so that it allows me to implement various Unicode algorithms in codeset-independent way. Probably by translating the unicode character database fully into the target encoding, or something. At the moment I'll go with the current half-baked solution, for it satisfy the immediate needs. Let's see how it works out in the actual applications.

So far almost all language implementations seems to adopt Unicode. A notable exception is CRuby which allows string objects with different encodings coexists at the runtime. I wonder how CRuby developers will support, say, Unicode text segmentation, normalization, bidi, etc. on their multi character-set strings.

Tags: text.unicode, Unicode, r6rs, r7rs, encoding

Post a comment

Name: