Sunday, December 25, 2011

Redmine on Ubuntu 10.10

Most tutorials for setting up Redmine on Ubuntu end with using Webbrick to test, but I wanted a production setup. This is a reconstruction of what finally worked (assuming no prior installs of ruby or rails, but a fully set up LAMP stack):

$ sudo apt-get install ruby-dev redmine
$ sudo gem install passenger
$ sudo apt-get install apache2-dev libapr1-dev libaprutil1-dev 
$ echo "export PATH=/var/lib/gems/1.8/bin:$PATH" >> ~/.bashrc
$ sudo /var/lib/gems/1.8/bin/passenger-install-apache2-module 
$ sudo a2enmod rewrite

$ sudo vim /etc/apache2/sites-enabled/000-default
LoadModule passenger_module /var/lib/gems/1.8/gems/passenger-3.0.11/ext/apache2/
PassengerRoot /var/lib/gems/1.8/gems/passenger-3.0.11
PassengerRuby /usr/bin/ruby1.8
<VirtualHost *:80>
  DocumentRoot /usr/share/redmine/public
  <Directory /usr/share/redmine/public>
    AllowOverride all
    Options -MultiViews

And some bind, if you are running your own DNS...

$sudo vim /etc/bind/named.conf.local

zone "" {
      type master;
      file "/var/lib/bind/";

$sudo vim /var/lib/bind/

$ttl 38400    IN    SOA    tower110103. (
            38400 )    IN    NS        IN     A    IN    A    IN    A    

Restart apache and bind

$ sudo /etc/init.d/apache2 restart
$ sudo /etc/init.d/bind9 restart



Monday, November 7, 2011

Bare Singulars and Bare Plurals

Since my final paper for our Advanced Semantics Seminar on Plurality, generics and bare singulars (including incorporated nouns) and bare plurals have been near and dear to my heart.

Naturally a post on generic comparisons on the Language Log quickly got my attention. Liberman argues using generic plurals toys with the gap between statistically significant generalizations and the grammatical genericity/generalizations
 that the results are presented in a way that misleads the public — and in some cases, the use of generic plurals seems to mislead the scientists themselves.
He sites a number of examples from by Sarah-Jane Leslie  about "Generics and Generalization"
 "Ticks carry Lyme Disease", although only a minority of ticks do so (14% in one study). 
"Mosquitoes carry West Nile Virus", though the highest infection rate found in the epicenter of a recent epidemic was estimated at 3.55 per thousand (and the rate was essentially zero outside of the epicenter).
"Ducks lay eggs" and "Lions have manes", though in each case the prevalence is at most 50%. (Female Ducks lay eggs, and only Male Lions have manes). 

Sunday, November 6, 2011

Fighting the Unicode Fight

Almost anytime I have to build a new corpus the Unicode Fight returns. I lived many Unicode Fight free years when Linux became 100% Unicode, but now I'm using Mac OSX.

The default file.encoding for Mac is MacRoman. I've tried a whole variety of Googling to find the keywords to find out the proper way (using the System Preferences) to set the default to UTF-8 to no avail. I really hate Google's new (~6mos ago) search algorithm that tries to guess what we mean to ask, and doesn't include all keywords we query. It makes it near impossible to find anything long-tail-ish.

This is when it started working in Java/Groovy:

  • created a file /etc/launchd.conf and put this into it:

 setenv JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF-8
For general purposes:
  • added this to my ~/.vimrc
set encoding=utf-8
set fileencoding=utf-8
  • added this to my /etc/bashrc
export LC_CTYPE=en_CA.UTF-8
export JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF-8
  • Changed my Terminal > Preferences > Encodings to only UTF-8, and Terminal > Preferences > Settings > Advanced > International to UTF-8 (I also put the Font to Menlo)
  • For good measure I changed all my text input to Inuktitut (that ought to force Unicode for good :)

I'm pretty sure this will have dastardly side-effects for any of my Java programs (I'm most curious about Eclipse and GATE)... we'll see.

Now groovy picks up the flag and sets the encoding to UTF-8
$ groovy -e "println'file.encoding'"Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
And Inuktitut now prints out on in the GroovyConsole, instead of ????? My groovy code which contains Inuktitut isn't getting saved as MacRoman  replacing all Inuktitut with ? anymore by GroovyConsole, ah... finally.

Tuesday, November 1, 2011

Bootstrapping Android Best Practices

Like human language, programming languages are 

  • 1 part syntax, 
  • 1 part vocabulary, and 
  • 1 part culture/socio-linguistics. 
Too often when learning a new language we focus on syntax and vocabulary, but not enough on culture/best practices. Sure in our courses we might learn that the French like wine and baguettes, and wear berets, but on the ground its not really that simple (n'est pas?). In this tutorial we "immerse" ourselves in the culture of two projects to simultaneously learn syntax, vocab and best practices for getting things done in Android Development.

We have selected a few repositories, 2 which show best practices, and 2 pairs of pidgins vs best practices which show not fully formed Android development.

Beginner-Friendly Best Practice Learning Grounds

  • Advanced: good project management (instructions end to end: how others can set up the code and contribute to the project)
Replica Island
  • Advanced: Game engine

Pidgins vs. Best Practices Pairs

Two page-curl repos
Three Blogger clients

Google IO Sched 2011

  • Advanced: using fragments for phones and tablets (warning: this creates many layers of abstraction so its hard to navigate as a beginner)
If you want to use MyTracks you will need to follow their steps to set up your Eclipse environment, and update your AVD manager

Monday, October 31, 2011

One busy month

I have been really busy with my other app and conferences so I didn't push out any new releases of AuBlog, and it shows in my active installs. The active installs are staying pretty constant.

I have a couple of projects to work on before I can release version 2 of AuBlog which I would like to make GUI-free and really focus down on a couple of core features. I'm targeting January...

Total active installs 

Recording voice, eye-gaze and touch on Android tablets

We presented our codebase which records voice, eye gaze and touch on Android tablets at the Academy of Aphasia annual meeting a few weeks ago. Our poster is here

Our Architecture and Results

2011      (with A. Marquis and A. Achim) "Aphasia Assessment on Android: recording voice, eye-gaze and touch for the BAT," Academy of Aphasia 49th Annual Meeting, Montréal.

I wanted to make it as easy as possible to reuse our code so I made a couple of videos to walk through the project and explain it in non-technical terms.

The first video talkes about the Android side which simply collects the video, audio and touch data.

The second video talks about the "server side" where a lot of the open source repositories are used and the really exciting data extraction and analysis takes place.

The third video gives an overview of how to get the code.

The fourth video is a lot longer than the others because it shows how you can adapt the project to your own experiment and also how you can use GitHub to manage your own projects (good for long distance collaboration and delegating among team members).

The last video is a quick demo of the touch data we got for the stimulus "shin" for our subjects, with a lively rendition of "Parole, Parole" :)

"Bébés Bescherelle" aka recent proof that morphosyntax is acquired as young as 11 months

Bébés Bescherelle !

Par Pierre-Etienne Caza
Alexandra Marquis
Photo: Nathalie St-Pierre
Depuis une quarantaine d'années, les spécialistes du développement du langage affirment que les verbes sont complexes à apprendre. C'est pour cette raison qu'ils apparaîtraient si tard - autour de l'âge de 18 mois - dans la parole des enfants. Mais cela ne veut pas dire que ces derniers n'ont pas commencé à décoder les subtilités de la conjugaison bien avant. «Les enfants sont en mesure de reconnaître les terminaisons verbales dès l'âge de 11 mois», affirme Alexandra Marquis, qui publie cet automne les résultats étonnants de ses recherches doctorales dans la revue Cognition, en collaboration avec sa directrice de recherche, la professeure Rushen Shi, du Département de psychologie.
Cette recherche, la première au monde qui démontre la capacité d'analyser des mots conjugués chez les bébés si jeunes, est née d'une remise en question d'études américaines situant la reconnaissance des formes verbales entre l'âge de 14 et de 17 mois. «Les bébés reconnaissent des mots très connus commemaman dès quatre mois, et des noms moins connus à partir d'environ six mois, souligne Alexandra Marquis, récemment diplômée du doctorat en psychologie. Pourquoi ne reconnaîtraient-ils pas des formes verbales ?»

Plusieurs expériences

Pour trouver réponse à ses interrogations, la jeune chercheuse a imaginé une vingtaine de situations expérimentales où des enfants âgés de 8 à 18 mois étaient exposés à des extraits sonores préenregistrés et à des images, dans le laboratoire de la professeure Shi. Elle a d'abord vérifié si les enfants reconnaissaient les formes verbales simples, comme mange ou chante. «Les enfants ne reconnaissent pas ces formes verbales à 8 mois, mais ceux âgés de 11 mois sont capables de le faire», note-t-elle.
La deuxième étape de ses expérimentations : l'association d'une forme simple à une forme conjuguée. Pour la terminaison fréquente en -é, comme mangé, cela a été concluant. «Les enfants ont déjà entendu ce paradigme de conjugaison dans leur entourage, par exemple, mange-mangétrouve-trouvé, etc.», explique-t-elle. Se peut-il que les enfants fassent l'association uniquement à partir des phonèmes initiaux communs, sans reconnaître les terminaisons et la relation de conjugaison ? «Je jugeais cette possibilité peu probable, parce que les enfants ne font pas l'erreur d'interpréter château comme contenant chat, ils traitentchâteau comme un mot entier. Mais il fallait pousser plus loin pour en être certain.» Pour cela, elle a dû créer des mots artificiels (comme glute) et refaire l'expérience avec une terminaison verbale impossible en français, soit le son -ou.
Les résultats furent spectaculaires : les enfants ont réagi à la forme verbale simple (glute), ainsi qu'à la forme conjuguée avec la terminaison -é (gluté), mais pas à la forme "conjuguée" avec le son -ou (glutou), puisque cela n'avait pas de sens pour eux.  «Ce n'est pas un coup de chance, précise fièrement Alexandra Marquis, parce que nous avons répété nos expériences avec des enfants de 11 mois, 14 mois et 18 mois et ils réagissent tous de la même façon.»
Les enfants ont même été soumis à une autre expérience intéressante : on leur a fait entendre plusieurs mots inventés différents qui se terminaient tous en -ou, afin de rendre cette terminaison «réelle» pour eux. «Cela prend deux minutes à un enfant pour faire un nouvel apprentissage, en raison d'une grande flexibilité neuronale», explique la professeure Shi. Ensuite, la chercheuse a répété l'expérience des terminaisons en -ou avec les nouvelles formes verbales. Cette fois, les enfants ont fait l'association entre les deux formes. «Cela signifie qu'à onze mois, avant l'âge d'un an, avant de parler, les enfants sont capables d'apprendre une nouvelle terminaison verbale et d'étendre la connaissance acquise durant leur développement à de nouvelles formes», souligne-t-elle.

D'autres recherches à venir

Si les enfants ne connaissent pas encore les autres terminaisons de verbe, plus complexes, c'est uniquement une question de fréquence d'occurrence dans la langue, croient Alexandra Marquis et Rushen Shi. «Lorsqu'ils auront suffisamment entendu les différentes terminaisons avec des racines variées dans le langage, ils seront en mesure de les appliquer, et ce, bien avant l'âge scolaire», précise Alexandra Marquis, qui ne compte pas s'arrêter en si bon chemin. Chargée de cours dans plusieurs universités québécoises, la jeune chercheuse poursuit des études postdoctorales à l'École d'orthophonie et d'audiologie de l'Université de Montréal. «Je souhaite établir une chronologie d'acquisition des paradigmes verbaux de la naissance à l'âge scolaire chez les enfants québécois, avant l'apprentissage des règles enseignées à l'école», conclut-elle avec enthousiasme.
Source : Journal L'UQAM, vol. XXXVIII, no 5 (31 octobre 2011)

Researchers with Open Data or Open Source are more likely to be cited

At the ETAP2 (Experimental and Theoretical Advances in Prosody) conference a strange thing happened while I was presenting my poster. A guy came over and spent about a half an hour talking to me about Open Data and Open Source. I got the sense that he was recruiting for something, but I assumed he was probably a professor looking for grad students.

After looking him up on the internet I discovered Heather Piwowar a PostDoc in Data One, a project sponsored at NASA to encourage researchers to keep their data (and their research) open and available. From what I can see they have some publications which show that if you keep your data open, and your source open, you're far more likely to be cited, which makes sense, people can open your data and look at it. By opening your data, you bring interest to your data and your research.

I'm trying to put my finger on why we as linguists are not completely confident in opening our data. I think one part of it is that we think someone else will publish our results as their own. For example, a friend of mine recently discovered that a rather famous researcher on their topic who was also a reviewer of their NSF dissertation grant, submitted a NSF grant proposal one year later, coincidentally to do fieldwork in the exact same city, on the same dialect and the same phenomena as their as yet unpublished dissertation.

Contrary to what we might think, putting our data online is actually one way to prove that we "discovered" it first. There will always be a server with a time stamp that shows that we published the information first. We feel like the only way we own our data and our work is by publishing it in a peer reviewed journal, but when it comes down to it, putting it online in a reputable open source repository like GitHub or open science repository like Open Wet Ware works like dating a copyright. For sure the data and findings needs to be published in a peer reviewed journal so others can cite you, but web links to your data can spread the word pretty fast, some times faster than a peer reviewed journal.

It was really exciting and validating to find out that there are projects and people out there helping us and encouraging us to keep our research as available as possible, and in fact those that engage in open research have more chance of getting tenure because they will indeed get citations and publications resulting from their open data.

Wednesday, September 28, 2011

I'm a just a dude like any other programmer

After listening to Should Google+ require you to use your real name? one fine sunny bike ride, I was left wondered if maybe my justification for anonymity might be more common than the authors might think. My wondering stopped there, until this evening when I giggled at my one of GitHub messages. 

There are many nefarious reasons to use a handle. Some people hide behind anonymity to post nasty comments on YouTube, troll in general, say abusive things or start mass riots in countries where freedom of speech isn't common.  But not all reasons for anonymity are nefarious, some are just about having a level playing field. As "cesine" I've quietly listened on user groups while others suggest new barbie wall papers, and flirty penguins to bring in some of the female persuasion over to Linux, etc. I once made an eye fluttering Tux and put it on my website, wondering if they might catch on that there was something different about the operator of that server.  

Anonymity for me meant having people treat me like any other programmer/geek. My real name isn't at all gender neutral like Tony, Alex or Jesse. But my handle, "cesine" which has been my web identity since 1996 is everything neutral. And apparently, it's working. I think this speaks for itself.

Now that I'm officially "out of the closet" in the blogosphere, its only a question of 10 minutes research to find I'm not your typical dude, but still that's 10 minutes most people won't take. As long as I don't have to use my real name, that's a 10 minute cushion of unbiased respect.

Monday, September 26, 2011

Precision vs. Recall defined for linguists

Precision and recall are some interesting examples of terminology from computer science which will help linguists know how to divide tasks best done by a linguist, from those best done by a script or some sort of automation; in other words, when it needs to be perfect, and when good enough, is good enough.

Recall means getting back all the examples in your data which display that factor. You can get high recall by writing a script which returns a lot of results. There is always a second step, to go through the examples yourself as a human to filter out the extraneous examples. Getting high recall is generally a good first step when you start your research (think: google web search, you really want to know all the authors that have written on your topic...)

Precision means getting data that you can run stats on and get statistical significance. High precision means all the results are what you were looking for. Getting high precision is important to make any claims or generalizations, its generally the last step (and highly valued) in your research.

How do you get high recall or high precision?

You get high precision by rules or having a human check your data (or multiple humans if its hard to detect or the classification is tricky). You get high recall by making a simple script or using statistics and setting your statistics threshold to be more permissive.

How do know which one you need depending on the context?

When you are working on theory ideally you want high recall and high precision (its basicaly the equivalent of necesary and sufficient conditions to define a set). Having high recall but low precision is okay, as long as your goal is to share your research and data and get feedback on the categorization of your data.

Wednesday, September 7, 2011

Watchmes for AuBlog

I made some quick-n-dirty Watchmes

How to use AuBlog for blogging via typing

How to user AuBlog for blogging via dictations

The machine transcriptions are hilarious, and not very useful. AuBlog uses an Open Source machine transcription software (Sphinx). It needs to be trained to your "iLanguage" (vocabulary) to return quality results...

Feature Algebra in a Nutshell

Feature Algebra is an "algebraic form of representation that allows the use of variables and indices for the purposes of identity checking" (Reiss 2002).

Posted using my Android

Wednesday, August 31, 2011

App Stats 20 days - Conclusion: Calling for Malay localizers :)

Let's take a look at my Android Market stats during open beta testing for Iteration I Aug 9-29 : User Interface

I have a total of 15 active users according to the Android Market, most of the users have either Android 2.2 (Froyo) or 2.3.3 (Gingerbread).

Total active installs : 15
Android versions:

1. 2.22,08478.17%
2. 2.3.338114.29%
3. 3.1672.51%
4. 2.2.1501.88%
5. 3.0.1431.61%
6. 2.2.2200.75%
7. 2.3.4130.49%
8. 3.270.26%
9. (not set)10.04%

Naturally I have had installs in Canada and the US, but 50% of my user base is in Indonesia and Malaysia  (probably my friends from Grad school) although I'm not sure how they found out about it, since I only told a couple of people that I knew that had Androids. There were also some users from Israel,  Austria, Sweden, India, Lebanon and the UK. How they found AuBlog and why they downloaded it I can't guess :)

26.7% (4)
26.7% (4)
United States
13.3% (2)
13.3% (2)
United Kingdom
6.7% (1)
6.7% (1)
6.7% (1)

The most important thing for me was to find out what kinds of devices I should target. I have tested my app with an HTC Desire (2.2),  Viewsonic Gtab (2.3.3) and a Motorola Xoom (3.1) all of which have different user experience with my app.

Samsung Galaxy Mini

Samsung Galaxy Tab 10.1

Samsung Galaxy S2

HTC Thunderbolt

HTC Wildfire S

LG Optimus One

Samsung Galaxy Fit

HTC Desire

Samsung Galaxy S

HTC Sensation 4G

Device Testing

The HTC Desire : my everyday device.

Bluetooth works seamlessly with Aublog. This is fortunate, as this is the device I go biking with. The audio is also a much better quality using the bluetooth although it is still only 8000 hz sample rate, yielding only 4000hz sample rate which is not usable for phonetic analysis. What a 4000 hz sample rate means is that high frequency noise like the white noise in the bursts and frication of stops, fricatives and affricates is not captured.

On the visual side of things, the Javascript in the Drafts Tree consistently renders the tree and the JavaScript in the Edit Entry page also consistently saves state when I rotate the device. On the other hand, the buttons in the WebView are quite ugly, by default.

The ViewSonic GTab

It doesn't work with Bluetooth, despite having bluetooth, and successfully pairing with my bluetooth headset. It is perhaps due to a lack of a telephony system. In order to play and record audio through bluetooth, the app must actually route the audio through the telephony system in earlier versions of Android. So the fact that Bluetooth works on the HTC Desire, maybe due to the fact that it is 2.2, or that it is a phone. The Bluetooth API underwent a lot of improvements in 3.0 and beyond so hopefully it will work more consistently on the other devices. On the other hand, Android 2.3 and above allows for wide band recordings of 16,000 hz sample rate which yields a much better quality and is useable for phonetic analysis as the range needed for human speech sounds is below 8000 hz.

On the visual side of things the large screen size is fun for using the Drafts Tree! While testing I realized I started using the titles in the nodes unconsciously to document the branching test scenarios for each feature and sequence of  user actions in AuBlog. It serves as a pretty quick tree generator, especially for nodes with small labels. I was planning on repackaging it into a tree drawing app for linguists, or tree thinkers.  The WebView buttons are also quite visually pleasing, by default.

A question for the tablet users: can you record and hear audio with your paired bluetooth headset when you enable it in AuBlog?

The Motorola Xoom

The bluetooth and the audio on the Xoom has the same story as the GTab, no bluetooth but surprisingly good quality with the built in mic.

The Xoom was an interesting visual surprise for AuBlog. First of all, after testing heavily on the HTC, and once and a while on the GTab I didn't expect the Xoom to present such an unusable user experience. There are no hardware buttons on the Xoom, the buttons are created by the software. Moreover, there is no Menu button by default. Instead, menu items are rendered in "the Action Bar" if the app targets SDK 11, using the same underlying code. However, despite targeting SDK 11, and using best practices for menu coding, the menu didn't show up on the Xoom (probably because there are just words as menu items, not icons). To have the menu show up I changed the target to SDK 10, and now the software menu appears. The next problem with the XOOM has to do with the Drafts Tree. The drafts tree is generated when the user clicks on Drafts. The Activity calls the WebView with the static HTML page, and the HTML page loads the JSON (the tree).  By initializing the tree only after the page has loaded, and the JSON has finished exporting, the tree loads properly on the HTC Desire, and also on the GTab, but not on the Xoom. At first I thought it might be due to having 2 cores (one core might be still writing the JSON to file, while the other is loading the JSON into the WebView, but the GTab also has two cores. It may also have to do with having a new WebView version. A question for the testers with Android 3.0 and up, does the Drafts tree load?

In future iterations (once I'm out of the prototyping stage and the architecture has stabilized) I was also planning on localizing AuBlog for another language to put all the mechanisms in place, my candidate languages were French (since I live in Montreal) and Inuktitut (since my informants might tell their friends about it), but it seems like Malay could be fun too since its 50% of my current users :) That is not for a few more iterations, I will wait until the architecture has stabilized further.

Architecture Growth
Itteration I Graphical User Interface : v1.0 - 4 Months
Standard Activities + WebViews for graphically interesting User Interface + Settings for persistant variables and user preferences

Most of the development time so far has been devoted to reconciling the interest of a visually pleasing user interface with maintaing state and not loosing user's entries while they type and rotate the screen. It seems like it should not be such a complex operation, but rotating the screen "destroys" and recreates the activity. Since the user is typing in a webview the Javascript interface has to save the state down to the activity in order to use the built-in state saving mechanisms. Nothing is more frustrating than loosing what you have just entered because you shifted in your chair and the screen rotated. On the other hand rotating the screen is crucial for a good user experience: some times you want to read the entry in wide screen, and then rotate to edit in portrait with the keyboard on the bottom of the screen. The user interface uses a couple of open source JavaScript libraries to make the drafts tree interactive (JIT) and the blog editing WYSIWYG (MarkItUp).

Iteration II Background Services: v1.1 - 20 days
Standard Activities  + WebViews for graphically interesting User Interface + + Settings for persistant variables and user preferences + Recording and transcription server interfacing refactored into Services + Broadcasts between modules

Audio Management Services and the transcription server architecture (Node.js and PocketSphinx) was surprisingly straightforward to set up, and also the most important core of AuBlog's mission. In this iteration AuBlog supports 3 audio configurations: Bluetooth for wire-free dictation, earphones+mic for private dictation, and built-in mic and speakers for default dictation. AuBlog also supports a variety of internet connection preferences, from transcription only over Wifi, to transfers over 3G for x file sizes. In my experience dictations are between 2-4 minutes long, and result in really small files (10-15KB) so its worthwhile to send them for transcription over a 3G connection while I'm biking. I made it so that its completely configurable and users can decide what they want to automatically transcribe. I generally only send the lowest daughter on a branch for transcription since the others are drafts, I prefer to re-listen to them and reformulate them until I'm satisfied.

Iteration III Voice User Interface & Widget : v1.2 - (3 Months)
A homescreen widget for "eyes-free" bike blogging. This will encourage refactoring the code into a central "brain" which issues intents to the other components.  This will also be an interesting challenge to make it as "eyes-free" as possible and set up/adopt architecture to use Google Voice to recognize app commands
"New Entry"
"Play/Read Mother Node"
"Play/Read/Record Sister Node"
"Delete Entry"
"New Daughter"
"Stop recording"

After spending so much precious dev time making sure the graphical user interface's javascript libraries is functional on an Android device, I remembered that I never really wanted a visual user interface; I wanted an "eyes-free" (meaning language only) interface. So, at this point the GUI is functional, and it will stay as it is unless there are some bugs which affect its functionality. I would love to auto-resize the nodes to fit the text but that would require a lot more dev time. I'm excited to start Iteration III, the part of the app that really interests me and is the part I can contribute to most: the VUI (voice user interface). I expect this will take me more time than Iteration II, which only took 20 days. It all depends on the existing open source modules I find and how well they integrate with Android.