Skip to main content

Fighting the Unicode Fight

Almost anytime I have to build a new corpus the Unicode Fight returns. I lived many Unicode Fight free years when Linux became 100% Unicode, but now I'm using Mac OSX.

The default file.encoding for Mac is MacRoman. I've tried a whole variety of Googling to find the keywords to find out the proper way (using the System Preferences) to set the default to UTF-8 to no avail. I really hate Google's new (~6mos ago) search algorithm that tries to guess what we mean to ask, and doesn't include all keywords we query. It makes it near impossible to find anything long-tail-ish.

This is when it started working in Java/Groovy:

  • created a file /etc/launchd.conf and put this into it:

 setenv JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF-8
For general purposes:
  • added this to my ~/.vimrc
set encoding=utf-8
set fileencoding=utf-8
  • added this to my /etc/bashrc
export LC_CTYPE=en_CA.UTF-8
export JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF-8
  • Changed my Terminal > Preferences > Encodings to only UTF-8, and Terminal > Preferences > Settings > Advanced > International to UTF-8 (I also put the Font to Menlo)
  • For good measure I changed all my text input to Inuktitut (that ought to force Unicode for good :)

I'm pretty sure this will have dastardly side-effects for any of my Java programs (I'm most curious about Eclipse and GATE)... we'll see.

Now groovy picks up the flag and sets the encoding to UTF-8
$ groovy -e "println'file.encoding'"Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
And Inuktitut now prints out on in the GroovyConsole, instead of ????? My groovy code which contains Inuktitut isn't getting saved as MacRoman  replacing all Inuktitut with ? anymore by GroovyConsole, ah... finally.


Popular posts from this blog

10.6.8 update spells Joy for Minimacs everywhere

If, after updating to 10.6.8 you get into a reboot loop, never fear the update is the same as every other update, except there is a step involving replacing the kernel.

This is very easy to do if you either (a) download it and save it on your Minimac before you update to 10.6.8, or (b) you have a mac formated USB key that you can copy it onto after your Minimac starts looping.

Here is the super-condensed minimal effort path to get you into Minimac heaven... (no not a dead Minimac, a running one), at least until Lion comes out.

On another computer (preferably a Mac or Ubuntu)
Download the legacy kernel[mirror]Put it on a Mac formated USB key * On the Reboot Looping Minimac
Hold down Shift as you bootAt the boot loader screen type (once you start typing it will apear in black letters on the bottom of the screen)  recovery=yes, -x Once it has finished loading, plug in the USB keyCopy the legacy_kernel-10.6.8.bz2 to your MinimacDouble click on it to unzip itMove the legacy_kernel-10.6.8 to …

English Noun Incorporation?

I was at a talk today with some Ojibwe data where invariably the claim that "English doesn't have incorporation" or at least incorporation of objects came up. We have "vacume clean" but generally we only incorporate the instrument. I remember a similar discussion coming up a few years ago in 2007 and I asked myself about apple picking. My colucators said, sure, but you can't say apple pick right? I thought about it a bit and came up with a linear string of words that might get google results. I remember I searched for "we apple picked" and found a few results, indicating to me that some people say it, generally when discussing their weekends. So, having my Android with me at the talk I googled again. This time I found a lot more examples than before, 394 to be exact, all of the first page clear examples with native speakers, speaking naturally.

I've heard this claim can be traced back to Baker 1988. When I got home I googled the claim "…