Thursday, May 5, 2011

Field Linguistics with GATE highlighting Quechua words in context results

I uploaded some screencasts made with QuickTime on my Macmini. The screencasts are from our Tools for FieldLinguistics workshop on April 30.

This is an example of what to do if there isn't a pipeline for the language you're working on. It shows grabbing a Quechua magazine, http://www.cenda.org/periodico/134/sup_134.pdf figuring out what are function words, what are content words and looking for suffixes. It highlights some things automatically using Jape so you can search for them in context. Depending on the language the division of functional vs content words is enough to get started. The content words (esp in agglutinative languages like quechua) are pretty interesting. You can use the output in this GATE example to hypothesize suffixes and then have some examples at your finger tips when you sit down with your informant. The script for this example is in here: https://github.com/cesine/ToolsForFieldLinguistics/blob/master/src/com/fieldlinguist/groovyInGATE/ExtractWordsOrderBySuffix.groovy If you want to run it on a real corpus you can use this one: https://github.com/cesine/CorporaForFieldLinguistics/blob/master/Quechua/quechua.utf8.txt It also works on any language, we just chose Quechua as an example.



33 organic views so far from posting it on our face book page.

The source code is here Tools for FieldLinguistics, its not that helpful for non-programmers yet but it will become more userfriendly over time and the more "Watchme" that I make.

No comments: