Monday, September 26, 2011

Precision vs. Recall defined for linguists

Precision and recall are some interesting examples of terminology from computer science which will help linguists know how to divide tasks best done by a linguist, from those best done by a script or some sort of automation; in other words, when it needs to be perfect, and when good enough, is good enough.

Recall means getting back all the examples in your data which display that factor. You can get high recall by writing a script which returns a lot of results. There is always a second step, to go through the examples yourself as a human to filter out the extraneous examples. Getting high recall is generally a good first step when you start your research (think: google web search, you really want to know all the authors that have written on your topic...)

Precision means getting data that you can run stats on and get statistical significance. High precision means all the results are what you were looking for. Getting high precision is important to make any claims or generalizations, its generally the last step (and highly valued) in your research.

How do you get high recall or high precision?

You get high precision by rules or having a human check your data (or multiple humans if its hard to detect or the classification is tricky). You get high recall by making a simple script or using statistics and setting your statistics threshold to be more permissive.

How do know which one you need depending on the context?

When you are working on theory ideally you want high recall and high precision (its basicaly the equivalent of necesary and sufficient conditions to define a set). Having high recall but low precision is okay, as long as your goal is to share your research and data and get feedback on the categorization of your data.

No comments: