TreeTagger - a language independent part-of-speech tagger
Deutsche
Version dieser Seite
The TreeTagger is a tool for annotating text with part-of-speech and lemma
information which has been developed within the TC project at the
Institute for Computational Linguistics of the University of Stuttgart. The
TreeTagger has been successfully used to tag German, English, French, Italian,
Spanish, Bulgarian, Greek and old French texts and is easily adaptable to other
languages if a lexicon and a manually tagged training corpus are available.
Sample output:
word |
pos |
lemma |
The |
DT |
the |
TreeTagger |
NP |
TreeTagger |
is |
VBZ |
be |
easy |
JJ |
easy |
to |
TO |
to |
use |
VB |
use |
. |
SENT |
. |
The tagger is described in the following two papers:
Download
Executable code for Sparc workstations, Linux and Windows PCs
and Macs as well as parameter files for English, German, Italian, Spanish,
Bulgarian, French and old French can be downloaded via the links below.
The French and the Italian parameter files are provided by Achim
Stein.
The English parameter file was trained on the PENN treebank and uses the English morphological database created
by Karp, Schabes, Zaidel and Egedi.
The Spanish parameter file was trained on the Spanish CRATER
corpus and uses the Spanish lexicon of the CALLHOME corpus of the LDC.
The Bulgarian parameter file was trained by Julien Nioche on the Bulgarian Treebank. It uses a UTF-8
encoding.
This software is freely available for research, education and evaluation. For
commercial licenses and for licenses for C programming interface, please contact
Helmut Schmid (at FirstName.LastName@ims.uni-stuttgart.de).
Please read the license terms,
before you download the software! By downloading the software, you agree to the
terms stated there.
The following steps are necessary to install the TreeTagger (see below for
the Windows version):
- Download the tagger package for your system (Sparc-Solaris,
PC-Linux,
Mac
OS-X).
- Download the tagging
scripts into the same directory.
- Download the parameter files for your system (Sparc-Solaris,
PC,
Mac).
- Download the installation script install-tagger.sh.
- Open a terminal window and run the installation script in the directory
where you have downloaded the files:
sh install-tagger.sh
- Make a test, e.g.
echo 'Hello world!' |
cmd/tree-tagger-english
or
echo 'Das ist ein Test.' |
cmd/tagger-chunker-german
If you have difficulties with the
installation, have a look at the installation
hints (kindly provided by Joachim Wagner).
Parameter files for Sparc-Solaris and Mac OS-X (Latin1
character set)
- English
parameter file (3045 kByte, gzip compressed)
- German
parameter file (7012 kByte, gzip compressed)
- small
German parameter file (2415 kByte, gzip compressed)
- French
parameter file (2375 kByte, gzip compressed)
- Italian
parameter file (5484 kByte, gzip compressed)
- Spanish
parameter file (918 kByte, gzip compressed)
- Bulgarian
parameter file (603 kByte, gzip compressed)
- German
chunker parameter file (52 kByte, gzip compressed)
Note: The German
tagger parameter file is needed, as well.
- English
chunker parameter file (82 kByte, gzip compressed)
Note: The English
tagger parameter file is needed, as well.
Parameter files for PC (Linux and Windows, Latin1 character
set)
- English
parameter file (2945 kByte, gzip compressed)
- German
parameter file (6642 kByte, gzip compressed)
- small
German parameter file (2340 kByte, gzip compressed)
- French
parameter file (2336 kByte, gzip compressed, information
about this file)
- Italian
parameter file (3238 kByte, gzip compressed, information
about this file)
- Spanish
parameter file (899 kByte, gzip compressed)
- Bulgarian
parameter file (579 kByte, gzip compressed)
- German
chunker parameter file (52 kByte, gzip compressed)
Note: The German
tagger parameter file is needed, as well.
- English
chunker parameter file (82 kByte, gzip compressed)
Note: The English
tagger parameter file is needed, as well.
A Windows
version of the TreeTagger is also available. The parameter files have to be
downloaded separately.
Tagsets
Here is some information about the tagsets used in the parameter files:
- English (Penn-Treebank tagset)
The tagset used by the
TreeTagger is a refinement of this tagset where the second letter of the verb
part-of-speech tags distinguishes between "be" verbs (B), "have" verbs (H) and
other verbs (V).
- German
(in German)
- French (in French)
- Italian
- Spanish
- Bulgarian
Links