cesine: Fighting the Unicode Fight

Sunday, November 6, 2011

Fighting the Unicode Fight

Almost anytime I have to build a new corpus the Unicode Fight returns. I lived many Unicode Fight free years when Linux became 100% Unicode, but now I'm using Mac OSX.

The default file.encoding for Mac is MacRoman. I've tried a whole variety of Googling to find the keywords to find out the proper way (using the System Preferences) to set the default to UTF-8 to no avail. I really hate Google's new (~6mos ago) search algorithm that tries to guess what we mean to ask, and doesn't include all keywords we query. It makes it near impossible to find anything long-tail-ish.

This is when it started working in Java/Groovy:

created a file /etc/launchd.conf and put this into it:

setenv JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF-8

For general purposes:

added this to my ~/.vimrc

set encoding=utf-8
set fileencoding=utf-8

added this to my /etc/bashrc

export LC_CTYPE=en_CA.UTF-8
export JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF-8

Changed my Terminal > Preferences > Encodings to only UTF-8, and Terminal > Preferences > Settings > Advanced > International to UTF-8 (I also put the Font to Menlo)

For good measure I changed all my text input to Inuktitut (that ought to force Unicode for good :)

I'm pretty sure this will have dastardly side-effects for any of my Java programs (I'm most curious about Eclipse and GATE)... we'll see.

Now groovy picks up the flag and sets the encoding to UTF-8

$ groovy -e "println System.properties.'file.encoding'"Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
UTF-8

And Inuktitut now prints out on in the GroovyConsole, instead of ????? My groovy code which contains Inuktitut isn't getting saved as MacRoman replacing all Inuktitut with ? anymore by GroovyConsole, ah... finally.

cesine

Sunday, November 6, 2011

Fighting the Unicode Fight

No comments:

The consequences of archiving open source repos

Search This Blog