|
| Mon, Jul 07th | home | browse | articles | contact | chat | submit | faq | newsletter | about | stats | scoop | 13:57 UTC |
|
login « register « recover password « |
| [Article] | add comment | [Article] |
In this tutorial, we're going to look at how Python can be put to work in the manipulation and analysis of corpora. Corpora (the plural of corpus) are collections of written texts or spoken language, usually structured in some way to facilitate their automatic processing. Copyright notice: All reader-contributed material on freshmeat.net is the property and responsibility of its author; for reprint rights, please contact the author directly. The CorpusA large corpus can provide a wide variety of useful information, provided that there are decent tools to extract it. In Natural Language Processing (NLP), for example, statistical information obtained from large corpora (consisting of tens of millions of words) is used to inform many different tasks, ranging from guessing the most likely parsing for a sentence to determining the likelihood that a document matches key terms in a search. In this tutorial, we will look at one particular English corpus, the Wall Street Journal (WSJ) corpus, which is a component of the Penn Treebank, and show how it can be manipulated using Python. (The article assumes at least basic familiarity with Python. If Python is new to you, try the Python-related links at the end of the article.) We will first build some homegrown tools for parsing and manipulating the WSJ corpus, and then discuss how the the Natural Language Toolkit (NLTK) for Python can be used to accomplish some of the same tasks. Penn TreebankThe full WSJ corpus comes with the Penn Treebank, which is available from the Linguistic Data Consortium (LDC). The full corpus is only available to members of the LDC, but a small part of it can be found in one of the NLTK's modules. Currently, there are three NLTK modules:
(The latest version of the NLTK, at the time of writing, is 1.4. If you install another version, there's no guarantee that all of the code here will work.) Obtaining the Penn Treebank
Full installation instructions for the NLTK can be found here. For now, you
only need to download and install nltk-data, instructions for the
installation of which are available for both Unix and
Windows.
We will assume here that the reader is working in a Unix environment and
that nltk-data is installed under The Wall Street Journal (WSJ) CorpusOur corpus of choice for this tutorial is the WSJ corpus, which consists of WSJ articles that have been tagged for their part-of-speech and annotated for their grammatical structure. For each article, there are three files: the raw text, the tagged text, and the annotated text. (We'll ignore the annotated texts here and focus on the raw and tagged ones.) Let's have a look at a sample file from the corpus, which is a short article about Zenith obtaining a lucrative contract with the American Navy. The plain text (raw) version of the article looks like this (wsj_0099):
The tagged version of the same article looks like this (wsj_0099.pos):
In the tagged version, each sentence in the article has been broken down into words, and each word has been associated with a tag that describes how the word functions in the sentence. These tags refer to what is traditionally known as a part-of-speech, such as noun, verb, adjective, or adverb. (And if you ever watched Grammar Rock, you may remember others, like the conjunction: "Conjunction junction, what's your function? Hookin' up words and phrases and clauses.") The main tags used in the WSJ corpus are listed below (see this overview of the project from Computational Linguistics for a more complete description):
Corpus Scripting: Manipulating Corpora with PythonWhy Python?When writing programs to analyze corpora, we often want quick-and-dirty tools for the rapid extraction of information. However, we also sometimes want to build larger systems. The ideal would be to have general-purpose tools that can be reused, either in full-scale applications or in short one-off scripts. Scripting languages fit the bill quite well, especially those with very good string processing capabilities, such as Perl and Python. Since Python has the Natural Language Tool Kit (NLTK), which provides various tools for natural language processing and comes with a sample of the WSJ corpus, it is our language of choice. Extracting TagsOne question we might immediately ask ourselves is: How often do the different tags occur in the WSJ corpus? We can answer this question by extracting all of the tags from the corpus and counting the number of times they occur using a Python script written to do the job, such as count_tags.py. In broad strokes, the script does the following:
The script would be run on the commandline as follows:
The output consists of two tab-separated columns. The first column lists the tags, and the second column has the number of times each occurs in the corpus. After you've run the scripts, see what the least and most frequent tags are. The default order is alphabetical by tag, but the output can be piped to Unix utilities to be sorted by value. We'll leave that as an exercise for the reader...
Since we assume basic familiarity with Python, we don't need to go
through count_tags.py in detail. The only part of the script that is
not straightforward is the function
Let's see how it works by looking at how an actual line from the corpus would be processed. We'll look at a line from wsj_0049.pos which possesses some special challenges:
To make discussion easier, let's first establish some terminology. We
will use the term token for a particular pairing of a wordform
with a part-of-speech. In other words,
Using this terminology, we can say that
The square brackets are ignored during the next step, which is to split
a token into a wordform and a part-of-speech using a slash. However,
some wordforms contain slashes in the original article
(e.g., Extracting a Word List from the Penn TreebankAs another exercise in corpus manipulation, let's take our corpus and analyze the frequency of words by part-of-speech. In other words, we want to produce a list of wordforms that tells us which parts-of-speech they function as, and how frequently. The Python script make_wordlist.py accomplishes this task. In broad strokes, it does the following:
This script is run in the same manner as the last one, although the output is obviously different, consisting of three columns (wordform, tag, frequency count):
As before, you may want to sort the output differently using Unix utilities, but even without any custom sorting, it should be obvious that all sorts of interesting information about word usage can be obtained from this kind of word list. The sample of the WSJ corpus available in the NLTK consists of only about 40,000 words, however, which limits its utility. As mentioned in the beginning, statistical information obtained from word lists can inform a variety of natural language processing tasks. For example, search technology can take advantage of this data to second-guess the intentions of users performing searches. For example, we find that the word yield functions primarily as a noun in the portion of the WSJ corpus available here:
On the basis of this type of information, we can assume that, all things being equal, if a user searches on the word yield, documents in which the word functions as a noun (e.g., wsj_0090: "They are keeping a close watch on the yield on the S&P 500.") are better matches than documents in which the word functions as a verb (e.g., wsj_0099: "There are no signs, however, of China's yielding on key issues."). The important proviso here is the qualification all things being equal. The genre of a text, the immediate local environment of a word, and a variety of other factors influence these statistics, and more sophisticated statistical models enable more sensitive fine-tuning of searches. For more information about the use of word statistics in natural language processing, see Manning and Schütze's book The Foundations of Statistical NLP. Using the Natural Language Tool Kit (NLTK)So far, we have written our own Python code to break the corpus down into tokens, but ideally, we shouldn't have to reinvent the wheel and write all of this low-level logic. There should be pre-existing tools that know about tags and tokens and the like, which could simply be used in whatever script we write. Fortunately, the world sometimes lives up to our ideals. Enter the Natural Language Toolkit (NLTK), which is, according to its authors, "a suite of program modules, data sets, tutorials, and exercises, covering symbolic and statistical natural language processing". In other words, the NLTK provides functionality in Python for language processing, and since it's Open Source, it's free, in every sense of the term, meaning that you can peek under the hood, tinker with it, and contribute to its development. You can learn more about what the NLTK has to offer by consulting the NLTK documentation, which is reasonably good. In addition, there are also two academic articles on the NLTK (1 | 2) and a few tutorials. But if you're feeling impatient and want to get your hands dirty, there is a mini NLTK tutorial by David Mertz (author of the Charming Python column). But before we can use the NLTK, we need to install it. The first step is to download the required files for the NLTK. As you will recall, the NLTK is divided into three modules. The module nltk-data should already be installed, and the module nltk-contrib can be ignored. It's the NLTK itself that you should be installing now. After you follow the installation instructions for the NLTK, you should familiarize yourself with its contents. As a step in that direction, we'll use the NLTK's functionality to perform the same two tasks handled by the scripts discussed above. Extracting a Tag List with the NLTK
The NLTK is organized into multiple packages which handle different
domains in natural language processing: tagging, parsing, probability,
text classification, etc. Since we are only doing fairly basic corpus
work, the only package we need is the
To illustrate the NLTK in action, let's tackle an earlier task, that of
counting the number of tags in a corpus. The script nltk_count_tags.py
should produce output identical to that of count_tags.py.
The main difference is that the parsing of corpus files and their
breakdown into sentences, words, tags, etc. is handled by the NLTK's
functionality! The script imports the
The program is run as follows:
You may have noticed that, unlike the previous scripts, this one does not take commandline arguments telling the script where the WSJ corpus files can be found. This is because the NLTK knows the location of the corpus in the filesystem. To find the path to these files and get a listing of them, you can query the NLTK using the following code (from nltk_wsj_filepaths.py):
Extracting a Word List with the NLTKThe script nltk_make_wordlist.py is very similar to make_wordlist.py. Again, the main difference is that the parsing of corpus files and their breakdown into sentences, word, tags, etc. is handled by the NLTK. The script uses the NLTK's treebank parser to read each file and tokenize it, and all of the tokens are parsed and entered into a dictionary along with their relative frequency. The program is run as follows:
Where to Go From HereAs they say, the journey of a thousand miles begins with a single step. Now that you have the NLTK installed and have used a small part of its functionality to perform a few simple tasks, you're ready to dig more deeply into corpus linguistics. The first step is to learn about some of the other parts of the NLTK, for tagging or parsing or text classification. Of course, the best programming skills in the world won't make up for bad theory and/or poor algorithms, so you might try reading more widely in the fields of linguistics and computational linguistics. Author's bio: Stuart Robinson is a Ph.D student at the Max Planck Institute for Pyscholinguistics in Nijmegen, The Netherlands, where he is conducting fieldwork-based research on Rotokas (a non-Austronesian language spoken in Bougainville, Papua New Guinea) as part of the Pioneers of Island Melanesia Project. He is currently co-authoring (with Harald Baayen) an introductory Python programming textbook for language researchers. T-Shirts and Fame! We're eager to find people interested in writing articles on software-related topics. We're flexible on length, style, and topic, so long as you know what you're talking about and back up your opinions with facts. Anyone who writes an article gets a t-shirt from ThinkGeek in addition to 15 minutes of fame. If you think you'd like to try your hand at it, let jeff.covey@freshmeat.net know what you'd like to write about. [Comments are disabled]
[»]
NLTK is interesting, but only as a starting point :-) While this article is an interesting reading, I think that some more alternatives must be presented. Yes, NLTK is a very interesting toolkit, especially when it comes to parsing, as a large number of parsers are included. However, a few more alternatives should be presented. First of all, when it comes to language processing, the Tcl scripting language should be also considered. It has the most mature unicode support from all scripting languages (and I think that python's unicode support was based initially on Tcl's) and the only language to my knowledge that has full unicode support in regular expressions. But as the choise of language is also a matter of personal taste, I want to point out that there is a platform specialised for NLP called Ellogon (http://www.ellogon.org) which offers the basis for processing components that can scale to really large corpora, and allows component development in C++,Tcl,Java,Python & Perl. This means that you can have components in various languages that can cooperate, and communicate with each other. --
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||