Tuesday, May 17, 2011

Soon all linguists will be first generation

At a Machine Learning talk last year the speaker was presenting his neural net which, if given Wikipedia data could learn to tag parts of speech, disambiguate words and even give pretty OK parse trees for any language. My thought (as a linguist who deals with lots of data in many languages) was "Cool! Bet it's not great, but at least its less bootstrapping that I have to do when I start navigating a new language's data." But the speaker said in question period that "linguists" disapprove of his approach, generally with some concerns that we cannot determine what his model is doing. Yeah, thats certainly true, since the nodes don't correspond to human knowledge explicit knowledge models. But I don't need the computer to model the data, I just want some rough clusters and classification so that I can go through the data and do the fine grained analysis using my human brain. We should also be careful with our human brains, we often over-generalize, see patterns that we think account for the data, but once we test them we discover that they account for only 10% of the data. To really account for the data we need a balanced approach, both large scale statistics, and some theoretical modeling...

When the speaker asked for a sample English sentence to test his model, someone offered "The cat the dog bit run" in an (incorrect) attempt to offer a center embedding example. As the audience member tried to reformulate his answer I realized this was the "linguist" that Machine Learning folks had met. Someone who had memorized sentences that Linguists find cool, in order to seem like a linguist at intelectual debates. I would also note, the other sentence the guy offered after failing the center embedding example was: "Colorless green ideas sleep furiously." Natural Language does have really interesting data, they should just go on the web and grab a sentence. Very few sentences are SVO, even fewer use center embedding or garden path like "The horse ran past the barn fell." Most sentences are chalenging for a model designed to recognize. There's no need to test the hypothetical sentences that were a challenge for our Transformational Grammar framework back when the textbooks were written.

That's when I realized that maybe that old quote about "hire a linguist and productivity drops" might be due to hiring "linguists" either back when most departments didn't emphasize comparative linguistics (pre 2000), or hiring a computational linguist who was raised by a linguist who was raised by a linguist back in the days of Transformational grammar when comparative linguistics either didn't exist or meant working on Native American languages and inventing "adhoc" grammatical categories because Latin grammar wasn't good enough to account for the data. It's not that linguists are useless, it's just that like every other field you need one with real experience and training, not just someone who did their homework and read their textbook. Generally you'll only get true experience during your PhD, but that is changing. More and more departments are putting fieldwork on their course offerings and teaching modern linguistics in their Intro classes, rather than a history of linguistics before they get to the modern part in the 400 level courses or worse, only once they are in grad school.

As a rough generalization, any "linguist" raised in a computer science department, or in many undergrad linguistics departments is better described as a 3rd or 2nd generation linguist. They are memorizing generalizations and simplifications that the profs of their profs taught them in the 1960s or 1980s! Linguist who do fieldwork or do comparative linguistics on the other hand have first hand access to data and don't just apply the generalizations they learn in school, instead they apply fundamental principles of linguistics to discover new patterns and generalizations. The more Natural Language Data we work on, the more useful principles we discover, if we let ourselves be human and not logicians. In my computer science courses I discovered that there was this similar tug of war between Computer Scientists who wanted to formalize everything in terms of logic, and others who wanted to make it work. Not all software can be written as formal algebras, not all linguistic generalizations can be either, especially if you dont know what are the primitive operations or the variables that the algebra operates over. Creating a chess game playing program is a challenge, but its the first step in artificial intelligence. Most games aren't played like chess, just like most sentences aren't SVO.

Since that Machine Learning talk I stopped introducing myself as a linguist, but rather as a fieldlinguist. At least then I can discover what the other person thinks a "linguist" is. There are lot's of linguists out there. As budding linguists in Ling 101 we learn to clarify that we aren't "translators," nor "polyglots" nor "philologists." I could also use comparitive linguist, or descriptive linguist, but that might be too close to something they think they understand. Field linguist is also technically true; I am a fieldlinguist. I use theoretical linguistics to inform my work, but its not the be-all and end-all of my research. Theoretical linguistics is still very young, we don't have good enough tools for Pragmatics, Discourse Analysis, Prosody or even Morphosyntax. I've spent the past 10 years in the field doing classic fieldlinguist things like immersing myself in Natural Language Data in context, hours of recording and transcribing and working with informants for a variety of language types and language families (Urdu, Korean, Romanian, Czech, Turkish, Inuktitut) etc..

Times are changing, many of my friends are doing a unit on fieldwork in their Ling 101 or Language and Society classes. It's fun for the students to "discover" that they can figure out a new language. They realize that they are human, and learning languages is what humans do naturally when exposed to data.

No comments: