Gauche Devlog

2011/07/18

Unicode Case-mapping Support and Character Set Independence

In my local branch, I've been working on enhancing case-mapping procedures such as char-upcase and string-downcase to cover the entire character range. I expect I can merge it into the master branch soon.

As of 0.9.1 Gauche's case mapping procedures only handles ASCII range, and pass through any characters beyond it. But R6RS and R7RS (draft) both define those case-mapping procedures in terms of Unicode definitions, so it's about time to modernize Gauche's character handling as well.

It isn't much trouble to generate tables from Unicode character database, although there's room to optimize them for speed and/or space (I'll post another entry to explain the strategy I took).

The issues

The trouble is that Gauche is designed to be independent from the internal character encoding scheme as much as possible. While the internal encoding needs to be chosen at compile-time, the code that depends on a specific encoding is small and mostly confined in gauche/char_*.h files.

Case-mapping adds nontrivial amount of code; it's not just a simple table of character-to-character mapping. In string case-mapping (e.g. string-downcase) each character's mapping is context-dependent, and to properly map the cases, we need to find word boundaries. That means we have to implement text segmentation, which is defined in UAX #29. That, in turn, depends on the character property database.

(The complication of this issue may be illustrated by this:

gosh> (string-downcase "ΧΑΟΣΧΑΟΣ.ΧΑΟΣ. Σ.")
"χαοσχαοσ.χαος. σ."

According to UAX#29 and R6RS, only the third sigma should take the final form.)

If I want to keep encoding-dependent code separate, I need to prepare those tables for each supported encodings and choose one at the compile time according to the configuration option of internal encoding. The tables can be generated mechanically, but having more code that selectively used by compiler options means more testing and maintenance headache. I don't like it.

Furthermore, full support of Unicode-defined algorithms such as text segmentation and normalization are useful even if Gauche's internal encoding is not utf-8. For example, if Gauche's internal encoding is euc-jp, then only a subset of Unicode characters are available for Scheme characters and strings, but such program can apply Unicode algorithm on a sequence of unicode codepoints (e.g. vector of integers). If I implement the algorithm anyway, there's not much point to limit access to it just because the characters aren't encoded in Unicode.

So, there are two issues regarding Gauche's character set independence and Unicode algorithms:

  • How much of the code should be separated as 'encoding-dependent' part? If I make it completely separated, it's architectually clean, but may become maintenance burden.
  • Should I make the algorithm itself available regardless of Gauche's internal encoding? There's no reason not to, but doing so means I need to implement algorithms independent from Gauche's internal character representation.

The compromise

I haven't found a clear-cut solution. So far my implementation became hybrid; I lost clear separation of dependence of internal character set. I had to add #ifdefs here and there to handle special cases for specific encodings, mostly because of optimization. That is, the efficient table configurations differ depending on the encoding, so the lookup code needs to be changed as well.

I also chose to generate some tables only for Unicode codepoints. If the internal encoding differ from Unicode, I convert the internal character code to Unicode codepoint to look up the tables. It penalizes EUC-JP and Shift_JIS encodings, but it has a merit of having one table and (mostly same) lookup code. I can also use the table for procedures that work on Unicode codepoint, not on Scheme characters; so that the algorithms can be used regardless of Gauche's internal encoding.

So far, the character general category table is the only table I generate for each internal encoding. The parser (read) needs to look it up, since the symbol constituent characters are defined in terms of character general categories (at least in R6RS). I don't want the symbol reader to convert each character just to determine the boundary.

For the rest of lookup tables, including case mapping, I only generate tables specialized to be looked up by Unicode codepoints. If the internal encoding is EUC-JP or Shift_JIS, the case mapping functions does conversion. The case conversion table has a certain structure that highly depends on the Unicode character distribution, so it needs some work to regenerate it for different encodings. If the performance becomes an issue I'll consider using separate tables.

(NB: The ASCII range are the same on all the supported encodings, so I shortcut if the given character is in the range. The encoding conversion only occur when the given character is outside of it. I might further improve the performance by excluding the range of the characters that is known to have no cases from looking up the table, for the case conversion routines is identity for such characters.)

CSI---is it still worth?

It would have been a lot simpler if I fixed the internal encoding to utf-8 (or other Unicode encodings). Single set of tables, and no separate API for characters and codepoints.

Unicode is not just a coded character set (CCS). It comes with a set of definitions of higher-level operations such as case conversions, text segmentations, normalizations, etc., defined on top of the CCS. More and more applications assume these high-level operations available.

Originally I decided to support choices of internal encodings since I wanted to keep options open; that is, I didn't want to be bound to a specific internal encodings. It's just a hunch that there might be some areas where alternative encodings would be beneficial. In fact it was useful, back in 2001, when almost all files I had, and my Emacs shell buffer, were in EUC-JP. And I was having headaches when any one of tools in the toolchain didn't handle Unicode well and caused wreak havoc.

These days most of tools just works with Unicode. That imaginary benefit margin of alternative encodings gets thinner, and if I penalize them by performance of higher-level character and string operations, whatever remaining margin would go away---it'd be just simpler and faster to convert everything to Unicode internally.

Well, many pro-Unicode people would say "I told you so." Yeah, I admit it. If I keep supporting the alternative encodings, at some point I'll need to reorganize the code so that it allows me to implement various Unicode algorithms in codeset-independent way. Probably by translating the unicode character database fully into the target encoding, or something. At the moment I'll go with the current half-baked solution, for it satisfy the immediate needs. Let's see how it works out in the actual applications.

So far almost all language implementations seems to adopt Unicode. A notable exception is CRuby which allows string objects with different encodings coexists at the runtime. I wonder how CRuby developers will support, say, Unicode text segmentation, normalization, bidi, etc. on their multi character-set strings.

Tags: text.unicode, Unicode, r6rs, r7rs, encoding

2011/07/11

Generic file locking---is it really about files?

Gauche provides POSIX file locking through gauche.fcntl module, but the feature isn't available on all platforms. Providing a locking API that's always available has been on my TODO list for years.

And it's finally in. In file.util module, now you have with-lock-file. It can be used like this:

  (with-lock-file lock-file-name
    (lambda () ... do something exclusively ...))

It uses a classic inter-process locking relying on the exclusive file creation. Optionally you can use atomicity of mkdir.

The function make sure if you throw an exception in the thunk it unlocks the file, but if the application dies unexpectedly, it leaves the lock file (or directory). If it finds a lock file that's very old, you may allow it to steal the lock, assuming that the creator of the lock file died without cleaning it up. It takes quite a few parameters to configure how to handle various situations.

★ ★ ★

Originally I planned to provide a lock API that abstracts underlying lock mecanism---either POSIX fcntl file locking or using lock files (or directory/symlink/...). In fact, I attempted to write one several times. However, the more I think about it, the less the idea become appealing. POSIX fcntl locking and lockfile-based locking differ so much and attempts to hide the difference always lead to nowhere.

The biggest obstacle was that hiding POSIX file locking under the hood turned out to be an extremely difficult job, if not impossible. That is, if a locking library is using POSIX file lock down below, the user of the library must be very aware of it.

That's because, to use POSIX lock robustly, you have to make sure to meet two conditions at least: (1) Making sure the fcntl lock is actually supported on a particular file, and (2) making sure that the file to lock won't be accessed by other libraries/threads in the same process. (The latter condition came from the fact that POSIX fcntl lock is per process, not per file descriptor. See, for example, A tale of two standards and On the Brokenness of File Locking for the detailed discussions.)

These conditions aren't necessarily a show-stopper for applications---at an application level, you may be able to choose a file to lock that's likely to satisfy them. At a library level, however, it is not trivial to check if the given file is really on a filesystem that supports POSIX lock, and it is probably impossible to make sure the file won't be accessed by other parts of the program.

There are also a conceptual difference between POSIX fcntl lock and lock-file based locking---the former is to lock a file, but the latter is not really about a file. It's a way to mutex multiple processes. Of course you can use shared file lock as inter-process mutex and vice versa, but conflating these two makes me feel uneasy.

So, with-lock-file ended up to be based on the lock-file/directory. It's not very fast (involves multiple syscalls and file access) but I expect it to cover most of trivial cases to avoid race between processes.

Tags: file.util, gauche.fcntl, FileSystem, InterProcessCommunication

2011/07/10

Quasiquoting

As of 0.9.1 Gauche doesn't handle quasiquote patterns something like this:

gosh> ``(,,@(list 1 2))
`(,,@(list 1 2))

It may be arguable a bug, or an unspecified behavior in R5RS. The outer-level quasiquote should expand unquote-splicing, and that should produce (unquote 1 2); however, quasiquote syntax in R5RS does not allow such pattern, so the resulting form wouldn't be a valid quasiquotation. It doesn't explain the above result, but if it's an error in R5RS term, anything can happen. (In fact, the result wasn't intended at all; it was just an outcome of failing recognition of the pattern.)

Still, it is allowed to have (unquote 1 2) as a pure datum, it may be more reasonable that ``(,,@(list 1 2)) expands into `((unquote 1 2)).

R6RS solved this ambiguity by allowing unquote to have multiple arguments. Since it is upper compatible to R5RS, I decided to implement R6RS behavior in Gauche. Now you get this:

gosh> ``(,,@(list 1 2))
`((unquote 1 2))

And this:

gosh> `((unquote 1 2))
(1 2)

★ ★ ★

While thinking about this, I found a difference between Scheme and CL about quasiquoting that I was not aware before.

In Scheme, `a, ,a, ,@a are merely a reader-level abbreviation of (quasiquote a), (unquote a) and (unquote-splicing a), respectively. The real semantics of quasiquoting is defined in terms of the latter S-expressions.

In CL, quasiquoting (bakcquotes) are defined in reader level. That is, comma-datum doesn't need to be an abbreviation of something like (unquote datum) or (comma datum). It doesn't need to have list equivalents at all. The comma and comma-atmark can be directly parsed by the reader, and transformed to whatever piece of code that produces the desired result.

It is not obvious in a simple case, but see the followings:

In Allegro:

cl-user> `'`(,,(cons 1 (cons 2 nil)))
'(excl::bq-list (1 2))

In CLisp:

[6]> `'`(,,(cons 1 (cons 2 nil)))
'(LIST (1 2))

This means the inner backquote form are interpreted rather liberally.

On the other hand, I believe in Scheme this is the only possible result (modulo printing `a or (quasiquote a) etc.)

gosh> `'`(,,(cons 1 (cons 2 '())))
'`(,(1 2))

A practical outcome of this difference is that, in CL, you cannot write an S-expression with a comma without corresponding backquote.

In Scheme, you can write a macro that implicitly inserts quasiquote:

gosh> (define-syntax implicit-qq
        (syntax-rules ()
          [(_ x) (quasiquote x)]))
#<undef>
gosh> (implicit-qq ,(+ 1 2))
3

IIUC, you can't do this in standard CL, since the expression (implicit-qq ,(+ 1 2)) raises error at the reader, before the form gets passed to the macro expander.

Tags: quasiquote, r5rs, r6rs, CommonLisp

2011/07/09

Not like the old-school Lisp atoms

Ah it's been long since the last entry. It's not that Gauche development is staggered; the thing is, I run several development threads concurrently, using local topic branches, and it is hard to pick a topic that can make a complete entry with conclusion.

Nah, too much excuses. From now on, I dump here whatever thoughts and experiments I did, slow but constantly, even without a nice conclusion.

Ok, so let me pick up where I left... the last entry was March 16. The next thing I did was to commit the support of atoms.

No, not that traditional Lisp atoms which mean something that's not a cons. Having atomp was once important to explain how Lisp worked, but these days, with so many different datatypes, I don't think having a term for non-cons objects is important. That usage should become a history.

These days, more important atomic property is that something be atomic with regard to concurrent execution. To confess, I steal the term from Clojure. Though Gauche doesn't abstract concurrent threads nor isolate mutable operations so much as Clojure does, I do find some recurring patterns I use while writing concurrent code in Gauche, and this is one of them.

An atom here is a wrapper to bundle one or more objects that needs to be updated atomically. It is available in gauche.threads module. The following creates an atom with two vectors.

  (use gauche.threads)
  (define a (atom (vector 1) (vector 2)))
  a => #<<atom> 0x17467e0>

To access those vectors safely, you pass a procedure to atomic.

  (atomic a (lambda (v0 v1) ...do whatever with v0 and v1...))

No other thread can access/modify the atom a until atomic returns.

Gauche's builtin hashtable is not thread-safe. I was always wondering whether I should provide a separate thread-safe hashtable type, or provide a generic support to allow exclusive access to any objects, like Java's synchronized mechanism. The former would explode the number of primitive datatypes (how about thread-safe treemap? thread-safe sparse hash? &c.) The latter is cleaner, but difficult to implement efficiently (especially, think about allow synchronized on any pair!)

Now atoms can be used to provide mutable objects exclusive access. Just say (atom (make-hashtable 'eq?)), and then wrap any access to the table by atomic. I expect that most cases of trivial use of <mutex> can be replaced by atoms.

Not only wrapping mutable objects, but atom itself can be a mutable objects with synchronization. You can replace its wrapped value by atomic-update!.

  (define b (atom 0))
  (atomic-update! b (cut + 1 <>))

This makes an atom b works as a counter with mutex. If the atom wraps N values, the update procedure should be N-in N-out procedure. To extract the value, a convenient procedure atom-ref can be used.

  (atom-ref b)

You may notice that if you retrieve a mutable object wrapped by atom by atom-ref, you can modify it outside of protection. Atoms are not intended to prohibit you from doing dangerous things; they just make it easier to do things safely rather than error-prone ways.

Finally but importantly, those atom primitive consistently takes optional timeout and timeout-val arguments.

  (atomic a (^(x y) ... #t) 10.0 'timeout)

This returns #t if we can access the atom successfully, but if we cannot obtain a lock for 10.0 seconds, it gives up and returns timeout. I find it indispensable to write a robust code. The semantics of timeout and timeout-val is the same as other thread primitives.

Tags: atom, gauche.threads, Concurrency

2011/03/16

Benchmarking utilities

I just expanded the features of gauche.time module (ref:gauche.time). Now it has convenience procedures to run benchmarks and compare results.

gosh> (time-these/report
       '(cpu 5.0)
       `((native-map! . ,(cut map  (cut + 1 <>) input))
         (scheme-map! . ,(cut maps!(cut + 1 <>) input))
         (native-map  . ,(cut mapa (cut + 1 <>) input))
         (scheme-map  . ,(cut maps (cut + 1 <>) input))))
Benchmark: ran native-map!, scheme-map!, native-map, scheme-map, each for at least 5.0 cpu seconds.
  native-map!: 5.267 real, 5.250 cpu (5.230 user + 0.020 sys)@86.86/s n=456
  scheme-map!: 5.268 real, 5.260 cpu (5.260 user + 0.000 sys)@85.55/s n=450
   native-map: 5.348 real, 5.330 cpu (5.330 user + 0.000 sys)@63.41/s n=338
   scheme-map: 5.338 real, 5.340 cpu (5.330 user + 0.010 sys)@62.92/s n=336

              Rate native-map! scheme-map! native-map scheme-map
  native-map! 87/s          --       1.015      1.370      1.380
  scheme-map! 86/s       0.985          --      1.349      1.360
   native-map 63/s       0.730       0.741         --      1.008
   scheme-map 63/s       0.724       0.735      0.992         --
#<undef>

If you know Perl, the output would look familiar. Yes, I blatantly stole the idea of Perl's Benchmark.pm. It still lacks the features such as cache control and flexible output formatting seen in Perl, but I'm happy to have this in the standard Gauche toolbox.

The rest of a little story I ended up having this.

When I add libraries to Gauche, I tend to make them as a set of relatively simple, solid, and mutually orthogonal components, rather than specialized, complex chunk of code each of which is designed for a particular application.

I guess that's rooted from the Unix culture and/or functional programming, and not very uncommon tendency. It is natural to expect it to be easy to assemble a new tool with those simple, orthogonal components when I face a new problem. Probably as the side effect of the tendency, I tend to be overly skeptical to a tool that is specialized to handle a single specific problem.

When I began thinking of enhancing gauche.time (I had been running benchmarks for optimization for some time then, and was getting bored with the tedious process), I started off with functional components that could be combined to achieve common task of benchmarking.

It took not long to realize that I couldn't decide a good interface in the intermediate layer, though. If I kept it simple, I feared there would be cases that it was not enough and the whole suite should be scrapped. If I made it general enough, the interface looked too complicated for typical usage.

I stepped back and thought what I wanted to do at the beginning. Basically almost always the task consists of running several pieces of code, measure their speed, and compare. Perl has a module specifically designed for that, and it looked to do just what I want. And it seemed that it's straightforward to copy its main features.

It could be constructed with orthogonal compoents; e.g. creating the layout of comparison matrix could be a separete component. But I don't understand the problem enough to design a clean interface to make it independent. So I hard-coded that.

I may rewrite them in future, once I have a solid idea on how to design generic text table formatting API. In the meantime, it is more useful to shortcut and have some working code, than wait for clean, neat, and composable components. Specialized tools aren't necessarily bad, if the problem is well defined and common enough.

Here are several procedures added for benchmarking. See the git repo for the details.

  • time-this : benchmark single code in thunk. returns <time-result> object.
  • time-result+, time-result- : operations on <time-result>.
  • time-these : benchmark several code to compare and returns list of results.
  • report-time-result : format the result of time-these.
  • time-these/report : combination of above two.

Tags: gauche.time, Perl, benchmarking

More entries ...