This is a little excercise in writing finite-state transducers. The goal is to convert spelled-out German cardinal numbers from 1 to 9999 to numerals and vice versa. I will construct the transducer by going from written-out numbers to numerals, e.g., from zweiunddreißig to 32. The implementation is done with the finite-state toolkit foma.
In a first step, we convert each written-out number into an intermediate representation by transducing compound parts to numerals in linear order, for example, neuntausend-zweihundert-acht-und-vierzig (hyphens added for better readability) should become 9000+200+8&40
.
Let's start with two-digit numbers.
TwoToNine = zwei:2 | drei:3 | vier:4 | fünf:5 |
sechs:6 | sieben:7 | acht:8 | neun:9 ;
Teens = zehn: [1 %0] | elf: [1 1] |
zwölf: [1 2] | dreizehn: [1 3] |
vierzehn: [1 4] | fünfzehn: [1 5] |
sechzehn: [1 6] | siebzehn: [1 7] |
achtzehn: [1 8] | neunzehn: [1 9] ;
TwentyToNinety = zwanzig:[2 %0] | dreißig:[3 %0] |
vierzig:[4 %0] | fünfzig:[5 %0] |
sechzig:[6 %0] | siebzig:[7 %0] |
achtzig:[8 %0] | neunzig:[9 %0] ;
TwoDigitNumber = eins:1 | TwoToNine | Teens | TwentyToNinety |
[ein:1 | TwoToNine] und:%& TwentyToNinety;
The digit 1 is special, since it sometimes appears as ein and sometimes eins, that's why it's not part of the first transducer. Some Teens
could be decomposed further, but this would only complicate things.
Because 0
has special semantics in foma regex, we need to make sure to escape 0
literals with the percentage symbol.
The ampersand that replaces und indicates that the numbers around it need to be swapped later.
For larger numbers, we need hundreds and thousands, so we define two more transducers.
Hundred = [ein:1 | TwoToNine] hundert:[%0 %0 %+];
Thousand = [ein:1 | TwoToNine] tausend:[%0 %0 %0 %+];
Now we can put everything together.
For the sake of simplicity, we ignore some variants: we recognize einhundert and eintausend, but not hundert and tausend; we recognize einhundertfünf, but not einhundertundfünf.
These variants can be added easily, but the expressions become a bit messier.
Intermediate = (Thousand) (Hundred) (TwoDigitNumber);
If we check the lower-words
1 of this transducer, we see the following:
foma[1]: lower-words
1000+
1000+100+
1000+100+1&20
...
Now we need to transform these intermediate forms into the desired output.
We'll move from right to left.
In the end there may be a trailing plus (introduced by Hundred
or Thousand
).
We don't need it, so we add a cleanup rule.
CleanPlus = %+ -> 0 || _ .#.;
Then we remove any trailing zero after an ampersand and swap the digits around the ampersand.
Trunc = %0 -> 0 || %& ? _ .#.;
def Ins(X) [..] -> X || X %& ? _ .#.;
Copy = Ins(1) .o. Ins(2) .o. Ins(3) .o.
Ins(4) .o. Ins(5) .o. Ins(6) .o.
Ins(7) .o. Ins(8) .o. Ins(9) ;
Delete = ? %& -> 0 ;
Swap = Trunc .o. Copy .o. Delete;
Maybe there is some more elegant syntax to do this, but I define a helper template that inserts a digit in the context of itself, the ampersand and some other digit. By applying this template for each non-zero digit, we copy the digit before the ampersand to the end of the string. Finally we delete the original digit and the ampersand, completing the swap.
Now we can move further to the left and add our two digit number to the hundreds.
N = [%0|1|2|3|4|5|6|7|8|9];
FixHundred = %0 %+ -> 0 || %0 _ N .#. ,,
%0 %0 %+ -> 0 || _ N N .#. ;
We simply delete as many zeros as there are digits after the plus symbol.
The same is repeated with the thousands to the left.
FixThousand = %0 %+ -> 0 || %0 %0 _ N .#. ,,
%0 %0 %+ -> 0 || %0 _ N N .#. ,,
%0 %0 %0 %+ -> 0 || _ N N N .#. ;
We compose the pieces to obtain our transducer for cardinal numbers.
Card = Intermediate .o. CleanPlus .o. Swap .o.
FixHundred .o. FixThousand;
The final transducer has 10,000 paths: 9,999 numbers and the empty string. FSTs are closed under inversion (input and output labels are simply switched), so we can apply our cardinal number FST in both directions, as generator and parser.
foma[0]: regex Card;
4.8 kB. 37 states, 212 arcs, 10000 paths.
foma[1]: up
apply up> 2345
zweitausenddreihundertfünfundvierzig
foma[1]: down
apply down> fünftausenddreiundzwanzig
5023
1 Words of the lower or second projection of Intermediate
, that is, words o for which an input i exists such that Intermediate
maps i to o.
This page was moved from my old website.