Sunday, November 6, 2011

Fighting the Unicode Fight

Almost anytime I have to build a new corpus the Unicode Fight returns. I lived many Unicode Fight free years when Linux became 100% Unicode, but now I'm using Mac OSX.

The default file.encoding for Mac is MacRoman. I've tried a whole variety of Googling to find the keywords to find out the proper way (using the System Preferences) to set the default to UTF-8 to no avail. I really hate Google's new (~6mos ago) search algorithm that tries to guess what we mean to ask, and doesn't include all keywords we query. It makes it near impossible to find anything long-tail-ish.

This is when it started working in Java/Groovy:

  • created a file /etc/launchd.conf and put this into it:

 setenv JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF-8
For general purposes:
  • added this to my ~/.vimrc
set encoding=utf-8
set fileencoding=utf-8
  • added this to my /etc/bashrc
export LC_CTYPE=en_CA.UTF-8
export JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF-8
  • Changed my Terminal > Preferences > Encodings to only UTF-8, and Terminal > Preferences > Settings > Advanced > International to UTF-8 (I also put the Font to Menlo)
  • For good measure I changed all my text input to Inuktitut (that ought to force Unicode for good :)

I'm pretty sure this will have dastardly side-effects for any of my Java programs (I'm most curious about Eclipse and GATE)... we'll see.


Now groovy picks up the flag and sets the encoding to UTF-8
$ groovy -e "println System.properties.'file.encoding'"Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
UTF-8
And Inuktitut now prints out on in the GroovyConsole, instead of ????? My groovy code which contains Inuktitut isn't getting saved as MacRoman  replacing all Inuktitut with ? anymore by GroovyConsole, ah... finally.

No comments: