# -*- mode:markdown; -*-

== Compilation ==

This lexicon trimmer needs a little compilation first:

    $ ./autogen.sh
    $ make

(Don't "make install". I'm not sure what that will do.) As long as
lttoolbox is installed correctly according to
http://wiki.apertium.org/wiki/Minimal_installation_from_SVN, this
should work just fine. You may have to do

    $ PKG_CONFIG_PATH=$PREFIX/lib/pkgconfig
    $ export PKG_CONFIG_PATH
    $ ./autogen.sh
    $ make

if you installed lttoolbox in some strange $PREFIX, or are using a
Mac.

Check that compilation succeeded by doing `ls .libs`, it should show
some files named `libltpy.*`.


== Usage ==

To update the Sámi morphology, first make your private langs.cfg file
if you haven't already:

    $ cp langs.cfg.in langs.cfg
    $ edit langs.cfg

and make sure to change "SRC" to be the full path to where you checked
out the sme src directory in giellatekno, probably something like
"/home/username/giellatekno-svn/trunk/gt/sme/src/"

The "OUTPUT_DIR" should point to your apertium-sme-nob directory,
either with an absolute path, or relative to the path of the
`langs.cfg` file. Typically you'll have an INCLUDE line which
specifies a path to a file with the configurations which are common to
all users of the project.

Then, make sure ../sme-nob.autobil.bin is compiled and then you run:

    $ /usr/bin/python2.6 ./update-lexc.py --config=langs.cfg

(perhaps exchanging /path/to/python2 for /usr/bin/python2.6 or
whatever python2 version you have)


If you don't want to trim the lexicon down to the bidix, you can use
the sme-nob-notrim.langs.cfg file in your INCLUDE line (it will still
strip a lot of stuff that causes errors, but keep the lexicon large).


== Configuration format (langs.cfg) ==

The configuration files are in the
[JSON format](http://en.wikipedia.org/wiki/JSON), which consists of
ordered lists of strings like `["item1","item2"]` and unordered tables
of string-pairs like `{"name":"Joe","nose":"big"}`. Whitespace is
ignored. The following is an example configuration file:

    [
      {
        "LANG1": "sme",
        "LANG2": "nob",
        "PRODUCE_LEXC_FOR": "sme",
        "BIDIX_SIDE": "L",
        "OUTPUT_DIR": "../",
        "SRC": "/path/to/giellatekno/trunk/gt/sme/src/",
        "LEX_EXCLUDES": [
          " FIRSTCOMPOUND +;",
          " NUMERALCOMPOUNDS +;"
        ],
        "files": [
          ["sme-lex.txt",            {"no_trim": true}],
          ["propernoun-sme-lex.txt", {"pos_filter": "<N><Prop>",      "split": "LEXICON ProperNoun\n"}],
	  ["pp-sme-lex.txt",         {"pos_filter": ["<Po>", "<Pr>"], "split": "LEXICON Adposition\n"}],
          ["punct-sme-lex.txt",      {"no_trim": true, "split2": "\t! Rare punctiation:\n", "no_footer": true}]
        ]
      }
    ]

We have an outer list `[]` consisting of one main table `{}`, this
defines how to create one lexc file (one config could be used to hold
definitions for several lexc files, then we'd have several main tables
in that outer list). The main table says that the languages involved
are `LANG1` "sme" and `LANG2` "nob" (ie. the language pair is named
`apertium-sme-nob`), and we want to `PRODUCE_LEXC_FOR` "sme", which is
the left (`L`) side of our bidix. The script assumes standard Apertium
naming schemes, so given this information, we assume the compiled
bidix named `sme-nob.autobil.bin`, and the lexc is written to
`apertium-sme-nob.sme.lexc`.

The `OUTPUT_DIR` is the path to where the output lexc file should end
up (and where we find the compiled bidix), it can be relative to this
configuration file, or an absolute path. The `SRC` is the path to the
directory where the below list of files below reside. 

`LEX_EXCLUDES` is a list of Python regular expressions. Any line of
any source file which matches such an expression, will not be in the
output lexc.

`files` is the bit that requires most tweaking. This is a list of
files, where you define how to include each file. The first element is
the file name (relative to `SRC`), the second is a table of options.
The first line in the example has `"no_trim":true` (the default is
false), and says that the whole file `sme-lex.txt` should be included
first, with no modifications (except what `LEX_EXCLUDES` defines).

The second entry says that proper nouns should be trimmed to bidix.
The `pos_filter` in the options table is `<N><Prop>`, which means that
we require any entry in the proper noun file to have an entry in bidix
which has the tags `<N><Prop>` right after the lemma. As the following
line shows, `pos_filter` can also be a list of tags – the file
`pp-sme-lex.txt` defines both postpositions with the bidix tag `<Po>`
and prepositions with the bidix tag `<Pr>`.

Both the second and third files have a `split` point, this must match
a line in the file from which we start trimming. Anything above this
line is included verbatim (unless there is a `"no_header":true` entry
in the table, in which we include nothing above the `split` line).

Similarly, we can have a `split2` point, anything below this line is
included verbatim, or not at all if we have a `"no_footer:true` entry
in the table. The final file entry in the example defines both a
`split2` point `"no_footer:true`, so it only includes anything above
the line which says `\t! Rare punctiation:` (it also says
`"no_trim":true`, so it won't do any bidix trimming).

There is of course nothing stopping you from adding the same file name
twice to the list; if you want, you can define lots of different
"slices" of a file by using `no_header` with `split` and `no_footer`
with `split2`.

**Note**: use `\t`, not literal tabs in JSON strings. And
unfortunately only ASCII symbols are allowed for now.


== Troubleshooting ==

=== Error: mach-o, but wrong architecture ===

If you get an error like:

    Traceback (most recent call last):
      File "update-lexc.py", line 606, in <module>
        sys.exit(main())
      File "update-lexc.py", line 594, in main
        run_with_conf(XC)
      File "update-lexc.py", line 555, in run_with_conf
        lang
      File "update-lexc.py", line 453, in make_lexc
        fst = liblt.FST(".libs/libltpy.dylib", BIDIX_BIN)
      File
    		 "/apertium/staging/apertium-sme-nob/update-morph/liblt.py",
    		 line 6, in __init__
        self.__lib = CDLL(libpath)
      File
    		 "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/ctypes/__init__.py",
    		 line 353, in __init__
        self._handle = _dlopen(self._name, mode)
    OSError: dlopen(.libs/libltpy.dylib, 6): no suitable image found.  Did find:
     .libs/libltpy.dylib: mach-o, but wrong architecture
    Exception AttributeError: "'FST' object has no attribute '_FST__handle'" in <bound method FST.__del__ of <liblt.FST object at 0x412fb0>> ignored
    
then it's likely that the python version you are running is not the
same architecture (32-bit vs 64-bit) as the compiled file. To find out
the architectures of the files, do:

    $ file .libs/ltlibpy.dylib
    $ file `which python2.6`   # or python2.7 or whatever

and if you get something like "symbolic link to X", do `file X` until
you see something that says either 32-bit or 64-bit. If your python
turns out to be 32-bit and you're on a Mac, you should be able to use
`/usr/bin/python2.6` instead (which is Universal, ie. both 32-bit and
64-bit).

== TODO ==

* maybe move this into trunk/apertium-tools/apertium-hfst-tools if it
  becomes generally useful, as it's a lot less painful than using the
  lt-expand method
