Secteur TAL Informatique,
Université Sorbonne nouvelle, Paris 3
19 rue des Bernardins, 75005 Paris
Statistical Natural Language Processing (using Perl)
Vous trouverez ci-dessous, des outils, des textes et des liens "autour" du domaine suivant : "Statistical Natural Language Processing". Ces informations ont été recueillies en particulier à partir du cours développé par J. Goldsmith à l'adresse suivante : http://humanities.uchicago.edu/faculty/goldsmith/
Voir aussi sur la page TP7
Le site LEXICOMETRICA
Livres de référence
Foundations of Natural Language Processing by Christopher D. Manning and Hinrich Schütze (MIT Press, 1999).
- Manning and Schütze have some great materials and links on the Web that you must explore. Link: http://nlp.stanford.edu/fsnlp/.
Cours en ligne construits autour de ce livre
- USA
- UPenn CIS530, UPenn CIS639, Berkeley SIMS 296a-4, Johns Hopkins: current (Eisner) and previous [lots of great slides by Jan Hajic!], Brown CS241, CMU 11-682, CMU 11-761, Stanford CS224N, MIT 6.863, Oregon Graduate Institute CSE580, Ohio State, U Chicago, Tufts, Minnesota, SUNY Albany, San Diego SU, Mississippi State.
- Canada
- U Toronto CSC401, Dalhousie U CSCI4152 [lots of slides!], U Alberta.
- Europe
- Edinburgh Cogsci MSc, Helsinki University of Technology TIK-61.182, Munich, Bochum. Götenborgs
- Asia
- Pohang University of Science and Technology (POSTECH), National Cheng-chi University, Taiwan
Speech and Language Processing by Daniel Jurafsky and James H. Martin (Prentice-Hall, 2000).
Textes disponibles dans cette distribution (PDF)
Textes de J. Goldsmith (PDF ou PowerPoint)
Perl
Source for Perl: http://www.perl.com/pub/language/info/software.html
Eric Brill's short guide to Perl.
Lectures
- Eric Brill :
- Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. 543-565
- Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging. Proc. of 3rd Workshop on Very Large Corpora, MIT, June. Also appears in Natural Language Processing Using Very Large Corpora, 1997.
Lectures sur le web
NLP thèmes
Probability and information
- A good introduction to information theory by Thomas D. Schneider (postscript format), and more advanced material by David J.C. MacKay. Don't miss Chris Hillman's entropy page.
- Notes by John Goldsmith. Here is a document on probability, the old one and one on information theory.
Collocations and mutual information.
Use the various
Biword frequency Perl scripts I've linked below.
- Biwords1.pl basic program for exploring biwords composed of neighboring words.
- Biwords2.pl program with more command-line options.
N-gram models over sparse data.
- Mutual Information : Biwords4.pl. The command line format is much like what you're accustomed to: perl Biwords4.pl, followed by the name of the data file, the name of the file you wish to output, followed by the parameter p (for "punctuation"). This eliminates punctuation, and it's essential, because if you leave punctuation in, opening the output file with a spreadsheet will confuse the spreadsheet program endlessly (it's the quotations that kill it). The output is a matrix of the top 36 words by frequency, and the frequencies of the biwords constituted of these words. The results are quite striking and interesting! Look at how much blank space there is: this is the sparseness of language. And look at the joint entropy (that's the spread-out-ness of the biwords as a data sample in biword space), and also the mutual information. Notice that the joint entropy plus the mutual information equals the sum of the entropy of the set of words on the left and words on the right.
Inferring morphological structure from a corpus
- Download Linguistica from http://humanities.uchicago.edu/faculty/goldsmith/ , and learn how to use it.
- Download a Zellig perl script to implement Zellig Harris's algorithm.
- Learning morphology from a pre-analyzed corpus: van den Bosch, Daelemans, Weijters 1996
Phonologie
- Are there more monosyllables or words of 4 syllables in English? -- in the dictionary, that is. Answer? About 15% of the words in the CMU dictionary are 4-syllables long, and only 13.5% are monosyllables. Run CountSyllables.pl on our (not perfectly syllabified) English word-list (indisponible).
- What are the most common syllable codas in English? ( Null, n, l, and r, in that order). What is the most common two letter coda? (nt). The most common 3-letter coda? Look it up, or run WordCoda.pl.
- What are the onsets in English, and how do they compare to word-initial onsets? Run Onsets.pl.
Statistical Natural Language Processing LINKS
Tools
Important statistical tools (system called "R"): http://www.r-project.org/
Part of Speech Taggers
Freely downloadable
- fnTBL
- A fast and flexible implementation of Transformation-Based Learning. Includes a tagger, but also NP chunking, etc.
- mu-TBL
- A Prolog implementation of a Transformation-based Learner (a la Brill), usable for POS tagging and other things by Torbjörn Lager. Web demo also available.
- Original Xerox Tagger
- A common lisp HMM tagger available by ftp. [Or only used to be?]
- Brill's Transformation-based learning Tagger
- A C symbolic tagger. Also available by ftp, and as a Windows version, with stuff for French.
- TreeTagger
- A decision tree based tagger from the University of Stuttgart (Helmut Scmid). It's language independent, but comes complete with parameter files for English, German, French, and Italian. (Solaris and Linux versions.) Used at visl.hum.ou.dk.
- Maximum Entropy part of speech tagger
- By Adwait Ratnaparkhi. JAVA version now downloadable. A sentence boundary detector is also available. [Helpful hint: This only works with JDK1.1. It doesn't work with JDK1.2+.]
- QTAG Part of speech tagger
- An HMM-based Java POS tagger from Birmingham U. (Oliver Mason).
- The TOSCA/LOB tagger.
- Currently available for MS-DOS only. But the decision to make this famous system available is very interesting from an historical perspective, and for software sharing in academia more generally. LOB tag set.
Free, but require registration
- ICOPOST
- C taggers by Ingo Schröder that implement maximum entropy, HMM trigram, and transformation-based learning. C source available under GNU public license.
- LT POS and LT TTT
- Edinburgh Language Technology Group tagger and text tokenizer (and sentence splitter). Binary only for Solaris. Doesn't allow you to train your own taggers.
- TATOO, The ISSCO tagger.
- HMM tagger. Need to register to download
-
PoSTech Korean
morphological analyzer and tagger. Follow the links Open Resources -
DownLoad.
- TnT - A Statistical Part-of-Speech Tagger
- Trainable for various languages, comes with English and German pre-compiled models. Runs on Solaris and Linux.
Usable by email or on the web, but not distributed freely
- Memory-based tagger
- From ILK group, Catholic University Brabant (Jakub Zavrel/Walter Daelemans). Does Dutch, English, Spanish, Swedish, Slovene. Other MBL demos are also available.
- Birmingham tagger by email
- Accepts only plain ASCII email message contents. The tagset used is similar to the Brown/LOB/Penn set.
- CLAWS tagger
- The UCREL CLAWS tagger is available for trial use on the web. (It's limited to 300 words though -- this site is more of an advertisement for licensing the real thing -- available as software for Suns or as a paid service.) You can also find info on CLAWS tagsets, though that page doesn't seem to link to the C7 tagset.
- The AMALGAM tagger
- The AMALGAM Project also has various other useful resources, in particular a web guide to different tag sets in common use. The tagging is actually done by a (retrained) version of the Brill tagger (q.v.).
- Xerox XRCE MLTT Part Of Speech Taggers
- Tags any of English, Spanish, French, Italian, Portuguese, German, and Dutch, online on the web.
- Portuguese tagger (Projecto Natura) on the web
Not free
- Lingsoft
- Lingsoft in Finland has (symbolic)
analysis tools for many European languages. More information can be
obtained by emailing
info@lingsoft.fi
. There is an online demo. - Conexor
- Conexor in Finland has demonstrations of EngCG-style taggers and parsers, for English, Swedish, and Spanish.
- Xerox
- Xerox has morphological analyzers and taggers for many languages. There are demos of some of their tools on the web. More information can be obtained by contacting Daniella Russo.
Parsers
Information on available probabilistic parsers can be found on the FSNLP: probabilistic parsing links page.
Language modeling toolkits
Downloadable
Downloadable, but requires registration
- The SRI Language Modeling toolkit
- by Andreas Stolcke is another good system for building language models, freely available for research purposes.
Not yet classified
- Lextools is a package of tools for creating weighted finite-state transducers (WFST) from high-level linguistic descriptions. Lextools binaries are available free for non-commercial use at: http://www.research.att.com/sw/tools/lextools/. Supported platforms are: linux (i686), sgi (mips2) and sun4. Lextools is built on top of, and requires, the AT&T WFST toolkit (version 3.6), available free for non-commercial use from: http://www.research.att.com/sw/tools/fsm/
Friendly concordancing and text analysis tools
- Wordsmith Tools (Mike Scott)
- The thing to get if you are working in the Windows world.
Other
Downloadable
- Bigram Statistics Package
- Perl code that implements: Fisher's exact test, the likelihood ratio, Pearson's chi squared test, the Dice Coefficient, and Mutual Information.
- ISIP tools
- The main aim is a publically available speech recognition system (alpha release available), but along the way there are also toolkits for discrete HMMs and statistical decision trees, and for various aspects of signal processing.
- Mem. A Perl implementation of Generalized and Improved Iterative Scaling
- by Hugo WL ter Doest.
- Automorphology
- A system (for Windows) for automatically learning the morphological forms of words in a corpus by John Goldsmith.
- Wordnet
- Wordnet is available by ftp, compiled for a variety of machine types. For money, one can also get EuroWordNet for various European languages.
- Penn XTAG project
- A wide-coverage tree-adjoining grammar written in a mixture of C and Common Lisp. Also includes a large coverage morphological analyzer. Now includes more tools such as TCL/Tk tree viewer.
- Dan Melamed's Tools
- A collection of tools including a simulated annealling program, a post-processor for English stemming for the Penn XTAG morphology system, Good-Turing smoothing software, general text processing tools, text statistics tools and bitext geometry tools (mainly written in Perl 5).
- MULTEX
- Constructing corpora and tools for processing multilingual corpora.
Contact: Jean Veronis
veronis@univ-aix.fr
. Some stuff including a multilingual text editor is downloadable. - Naive Bayes algorithm
- Software from the Rainbow/Libbow software package that implements several algorithms for text categorization, including naive Bayes, TF.IDF, and probabilistic algorithms. Accompanies Tom Mitchell's ML text.
-
A prototype Java
Summarisation applet (System Quirk)
- Emdros: a text database engine for linguistic analysis and research
Free, but require registration
- Stuttgart's IMS Corpus Workbench (CWB)
- A workbench for full-text retrieval from large corpora (with a query language and corpus indexing). Includes the Corpus Query Processor (CQP) and xkwic. Available free for research groups (currently only as Solaris 1/2 or Linux binaries), on signing a license agreement.
- Gate
- University of Sheffield's General Architecture for Text Engineering. Primarily an Information Extraction system.
- MITRE's Alembic Workbench
- A workbench for the development of tagged corpora. Includes a tagger based on Brill's TBL approach.
- SNoW
- SNoW is a learning program that can be used as a general purpose multi-class classifier and is specifically tailored for learning in the presence of a very large number of features. The learning architecture is a sparse network of linear units over a pre-defined or incrementally acquired feature space (Dan Roth).
- Tilburg University's TiMBL
- Tilburg's Memory Based Learner. A general near-neighbour-based machine learning package, but optimized for statistical NLP applications. Follow the "Software" link.
Unsure
- INTEX
- a finite-state transducer analysis system for English, French, and
Italian that runs under NextStep. Contact:
Max Silberztein
silberz@ladl.jussieu.fr
The PennTools page collects information on a variety of NLP systems, many of which are available externally.
Corpora
On-line resources
The best English dictionary resources: ftp://svr-ftp.eng.cam.ac.uk/pub/pub/pub/comp.speech/dictionaries
Mike Barlow's corpus linguistics resources: http://www.ruf.rice.edu/~barlow/corpus.html
European language links: http://www.lib.virginia.edu/wess/etexts.html
One of many multi-lingual Bible sites: http://www.godonthe.net/evidence/language.htm. I'm sure there are better; send me one if you find an especially good one.
Many dictionaries from many languages: http://www.yourdictionary.com/
Moby Project: http://www.dcs.shef.ac.uk/research/ilash/Moby/
the Penn resources list: http://www.cis.upenn.edu/~adwait/penntools.html
Large collections aimed at the NLP community
- LDC (Linguistic Data Consortium)
-
Email:
ldc@ldc.upenn.edu
. Provides the largest range of corpora on CD-ROM. Cost ranges from cheap (e.g., ACL-DCI disk) to pricey. CDs can be purchased individually; institutions can become members and receive discounts on CDs. Their catalog and some other info is available by ftp. There's an LDC Online service for searches over the web (mainly intended for members, but there are samplers available). - ACL/DCI (Association for Computational Linguistics Data Collection Initiative)
- Email:
fel@unagi.cis.upenn.edu
. Results are obtainable through LDC. - European Language Resources Association
- Rapidly growing collection of materials in European languages. The RELATOR homepage still exists, which was the first attempt, but its largely moribund, and you should go straight to ELRA.
- ICAME (International Computer Archive of Modern English)
- Sells various corpora (including
Brown and London-Lund). Information on corpora on
the web, by sending the
message
help
tofileserv@nora.hd.uib.no
, by ftp tonora.hd.uib.no
. Also, manuals for these corpora. - TRACTOR
- TELRI Research Archive of Computational Tools and Resource. Corpora, many multilingual, in European community languages. Small fee for joining in order to be able to get corpora (unless you have contributed corpora).
- CLR (Consortium for Lexical Research)
- Email:
lexical@nmsu.edu
. Focuses more on language processing tools and lexicons, but does have some corpora. As of Feb 1996, you can get most of their stuff by anonymous ftp toclr.nmsu.edu
. Their catalog is available as a postscript file. - OTA (Oxford Text Archive)
- Provides mainly literary texts. Has a bright new web
site. Email:
info@ota.ahds.ac.uk
. Most materials are available on the web or by anonymous ftp toota.ox.ac.uk
. Some require negotiations with the providers. - BNC (British National Corpus)
- A 100 million word corpus of British English. Now available to people outside the European Union! You can search it online from a simple web interface and there is an index to genres by David Lee and others.
- European Corpus Initiative Multilingual Corpus I (ECI/MCI)
- A 98 million word corpus, covering most of the major European languages, as well as Turkish, Japanese, Russian, Chinese, and Malay. Cheap. Need to sign a license agreement available at either the WWW site. Also available from the LDC.
- Survey of English Usage
- At the Department of English Language and Literature at University College London. Includes the British part of ICE, the International Corpus of English project. Now available tag, and parsed for function. 83,419 sentences. Includes ICECUP, dedicated retrieval software. ICE-NZ. ICE-HK. ICE-East Africa is available on the ICAME-2 CD.
- Corpora held by Lancaster University
- This link provides its own annotations.
- The European Language Activity Network
- Promises a uniform query language for accessing corpora in all EU languages -- but isn't quite there yet.
Particular languages
English
English language corpora available from the sites above are not repeated here.
- Corpora by Geoffrey Sampson's team
- The SUSANNE corpus and the CHRISTINE corpus (SUSANNE markup of a speech corpus).
- Penn-Helsinki Parsed Corpus of Middle English
- A syntactically annotated corpus of the Middle English prose samples in the Helsinki Corpus of Historical English, with additions. 1.3 million words. $200.
- Corpus of Professional, Spoken American-English (CPSA)
- 2 million words from faculty and committee meetings and White House press conferences (50K work sample free on internet).
- Lancaster Parsed Corpus
Multilingual
- World Health Organization Computer Assisted Translation page.
- Also includes a good selection of links on Computer Assisted Translation. (See also the copyright page.)
- Searchable Canadian Hansard French-English parallel texts (1986-1993)
- From the Laboratoire de Recherche Appliquée en Linguistique Informatique, Universite de Montréal
- European Union web server
- Parallel text in all EU languages.
- TELRI CD-ROMs
- Parallel and other text in central and eastern european languages.
Bosnian
Czech
- Parallel Czech-English
- Literature translations in Czech and English
- Czech National Corpus project: SYN2000
- 100 million words of contemporary Czech.
- The Prague Dependency Treebank.
- Contains half a million words of Czech, analyzed at the levels of parts of speech, syntactic functions (and, in the future, semantic roles) level in a dependency framework. Free on completion of license agreement.
-
CKIP Chinese
Treebank (Taiwan). Based on Academia Sinica corpus.
- LDC Chinese Treebank. 100,000 words. More info
- LDC Korean Treebank.
- LDC Chinese Treebank. 100,000 words. More info
French
- Association des Bibliophiles Universels
- Various French literary works.
- American and French Research on the Treasury of the French Language (ARTFL)
- 150 million word corpus of various genres of French. You have to be a member to use it (but membership is fairly cheap).
German
- COSMAS Corpus
- Large online-searchable German corpus
- NEGRA Corpus
- Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Available free of charge to academics. 20,000 sentences, tagged, and with syntactic structures.
Russian
- Library of Russian Internet Libraries
- Various literary works.
Slovene
- Slovene-English parallel corpus
- 1 M words, free to download + on-line concordances.
- Coming soon: Slovene reference corpus of 100 M words
Spanish and Portuguese
- TychoBrahe Parsed Corpus of Historical Portuguese
- Over a million words of Portuguese from different historical periods, some of it morphologically analyzed/tagged. Free.
- Information about Mark Davies' collection of (mainly historical Spanish and Portuguese.
- It's not clear what their availability is.
-
The CUMBRE corpus. Contact Professor
Aquilino Sánchez
- The CRATER Spanish corpus
- Morphosyntactically tagged telecommunication manuals) is available by ftp.
- NLP resources for Portuguese
- Lists corpora, dictionaries, terminological databases, tools and other possible pointers of interest.
- Folha de S. Paulo newspaper
- 4 annual CDROMs with full text.
- COMPARA
- Portuguese-English parallel corpus
- See also under ELRA, above.
Swedish
- Spraakdata, Department of Swedish, Göteborgs University.
- Has various searcable part of speech tagged Swedish corpora (Parole, Bank of Swedish, etc.), and some material in Zimbabwean languages.
Treebanks
- Penn Treebank
- Distributed by the LDC.
- BLLIP WSJ corpus
- Automatically parsed WSJ newswire, distributed by the LDC.
-
ICE-GB: the British
part of
ICE, the International Corpus of English project. Tagged and parsed
for function. 83,419 sentences.
- NEGRA Corpus
- Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Available free of charge to academics on completion of license agreement. 20,000 sentences, tagged, and with syntactic structures.
-
TIGER
project.
Under construction large collection of parsed German newswire.
- Verbmobil Tübingen: under construction treebanked corpus of German, English, and Japanese sentences from Verbmobil (appointment scheduling) data
- The Prague Dependency Treebank.
- Verbmobil Tübingen: under construction treebanked corpus of German, English, and Japanese sentences from Verbmobil (appointment scheduling) data
- Contains half a million words of Czech, analyzed at the levels of parts of speech, syntactic functions (and, in the future, semantic roles) level in a dependency framework. Free on completion of license agreement.
-
Syntactic Spanish Database (SDB)
University of Santago de Compostela. 160,000 clauses / 1.5 million words.
- Bulgarian Treebank. An under construction Bulgarian HPSG treebank. Currently POS-tagged texts are available.
- Floresta Sintá(c)tica project: under construction Portuguese treebank.
- Dublin-Essex Treebank project
- Bulgarian Treebank. An under construction Bulgarian HPSG treebank. Currently POS-tagged texts are available.
- Deriving Linguistic Resources from Treebanks
Literature
There are now quite large collections of online literature, available in various languages (though the majority are in English, of course). Below are pointers to some of the main collections:
Entirely or mainly English
- Alex: A Catalogue of Electronic Texts on the Internet
- Seems to have one of the largest collection. Searching and browsing facilities through gopher menus. Many languages.
- Wiretap Electronic Text Archive
- Extensive and good quality. Still in the gopher age, though.
- The On-line Books Page
- The index here only covers books in English, but there are lots of links to other collections of material in all languages.
- Project Gutenberg
- The oldest and largest project to get out of copyright literature online, freely available. (Or see the mirror, Sailor's Project Gutenberg site.)
- The Electronic Text Center of the University of Virginia
- Large collection of SGML text, mainly in English, but also in other major languages.
- Center for Electronic Texts in the Humanities
- Princeton/Rutgers collaboration. They didn't have it together with their web site when I stopped by, but they may soon.
- Oxford Electronic Text Library Editions
- Available from Oxford University Press, 200 Madison Ave, NY, NY 10016 212-679-7300. The Complete Works of Jane Austen is $95.00, and is reviewed in Computers and the Humanities, 28:4-5 (Aug/Oct, 1994), 317-321.
Acquisition data
- CHILDES database.
- Database of child language transcriptions in English and many other languages. Texts are also available by ftp. Certain usage requirements. Manuals and programs for accessing the data (the CLAN concordancer) are also available online.
SGML/XML
- Robin Cover's SGML/XML Web Page
- This is a wonderful compendium of information on SGML and XML, including information on the Text Encoding Initiative (TEI). This document is also a guide to many text collections (ones using SGML).
-
Information about the Text Encoding
Initiative (TEI). (The Pizza Chef acts as
a TEI tag set selector.)
- Microsoft's XML page
- W3C XML page.
- The Corpus Encoding Standard.
- Microsoft's XML page
- An SGML instance designed for language engineering applications. Also the XML version.
Dictionaries
Dictionaries of subcategorization frames
The following dictionaries all list surface subcategorization frames (each with a different annotation scheme). They are also all available in electronic form from the publishers (not free).
- COBUILD
- Collins Cobuild English Language Dictionary. London: Collins, 1987. The COBUILD web site lets you search their Bank of English corpus (but you need to pay to get more than a trial.
- LDOCE
- Longman Dictionary of Contemporary English. Burnt Mill, Essex: Longman, 1978.
- OALD
- Oxford Advanced Learner's Dictionary of Current English. Oxford: Oxford University Press, Fourth Edition, 1989. The third edition also had information on subcategorization frames, although in a different incompatible format. However, a partial version of the third edition (with this information) is available free online from the Oxford Text Archive.
Not exactly a dictionary, but another popular source is:
- Levin (1993)
- Beth Levin. 1993. English Verb Classes and Alternations: A Preliminary Investigation. Chicago. Discusses linguistic distinctions (like unergative/unaccusative verbs, dative shift, etc., not made by the above dictionaries). The index of verbs is online.
See also COMLEX and CELEX available from the LDC.
Dictionaries of assorted languages on the web
- The old version of Robert Beard's Web of Online Dictionaries long ago mutated into YourDictionary.com. I'm told the IPO has been delayed. Nevertheless, it's the most comprehensive index of dictionaries available on the web.
Names
U.S. names with frequency information, are available from the Census Bureau.
SGML structured dictionaries
- Cambridge International Dictionary of English and other products in SGML.
Lexical/morphological resources
- English SENSEVAL Resources
- Dictionary entries and tagged examples for 35 words.
- ARIES Natural Language Tools
- Lexicons and morphological analysis for Spanish. There is a free Prolog demonstrator, but the real lexicons and C/C++ access tools cost money.
Courses, Syllabi, and other Educational Resources
"Techie"
- Foundations of Statistical Natural Language Processing
- Some information about, and sample chapters from, Christopher Manning and Hinrich Schütze's new textbook, published in June 1999 by MIT Press. Read about courses using this book.
- Corpus-based Linguistics
- Christopher Manning's Fall 1994 CMU course syllabus (a postscript file).
- Statistical NLP: Theory and Practice
- Christopher Manning's Spring 1996 CMU course materials.
-
John Lafferty and
Roni Rosenfeld's Spring 1997 CMU course Language and Statistics.
- Boston University (John D. Burger and Lynette Hirschman)
- A good course and web site, by the looks!
- Draft of Data-Intensive Linguistics
- By Chris Brew and Marc Moens.
- Statistical Natural Language Processing course
- By Joakim Nivre. Elsnet suported.
- Short Course: Statistical Methods in NLP
- By Philip Resnik
-
Linguist's Guide to
Statistics by Brigitte Krenn and Christer Samuelsson.
- Statistical and Corpora Based Methods for Processing Natural Languages
- By Alon Itai, Technion Computer Science Department. (Don't read those old drafts of mine though ... get the real thing!)
- CS 241 Statistical Models in Natural-Language Processing
- Eugene Charniak, Brown University.
- Michael Littman, Duke: 1997, 1998.
"Corpus Linguistics"
-
A tutorial
on concordances and corpora by Cathy Ball
- Web material accompanying McEnery and Wilson's book on Corpus Linguistics
- Tony Berber Sardinha's Corpus Linguistics course
- Web material accompanying McEnery and Wilson's book on Corpus Linguistics
- Powerpoint slides in an interesting mixture of English and Portuguese (plus the rest of his homepage!)
- Concordancing and corpus linguistics
- Notes prepared by Phil Benson, Hong Kong University.
Mailing lists
Mailing lists that have information on these topics include:
- Corpora
- The main mailing list for info on corpus-based linguistics. Subscribe by
sending the message:
tosubscribe corpora
listserv@uib.no
. Or if you want to subscribe with a different email address, send: (Note that you're now speaking to a Majordomo server, not a listserv, so you don't send your name!). Or you can subscribe on the web.subscribe corpora email-address
- Empiricist
- The empiricist list appears to be defunct now. You used to send a
"subscribe" message to
empiricists-request@unagi.cis.upenn.edu
.
Other stuff on the Web
General resources
- Linguistic annotation
- A description of formats for linguistic annotation by Steven Bird.
- CTI Textual Studies, University of Oxford, Guide to Digital Resources
- Lists text analysis tools, corpora, and other stuff.
- U. Essex W3-Corpora
- Lots of teaching material, links, and online corpora.
- Computational Linguistics and NLP (Kenji Kita, Tokushima U.)
- A good well organized list of CL references, concentrating on corpus-based and statistical NLP methods. See also Software tools for NLP.
- HLT Central
- European Human Language Technology site
-
Survey of
the State of the Art in Human Language Technology
- ACL SIGLEX list of Lexical Resources
- Online materials for a course on Learning Dynamical Systems at Brown University.
- ACL SIGLEX list of Lexical Resources
- Lots of neat info.
- Expert Advisory Group for Language Engineering Standards (EAGLES) home page
- European standards organization.
-
Materials prepared
for Michael Barlow's Corpus Linguistics course
- Corpus Linguistics University of Birmingham
- Chris Brew's Teaching Materials for statistical NLP
- Corpus Linguistics University of Birmingham
- Not much there last time I looked; you might also try his home page.
- Edinburgh LTG HelpDesk's FAQ
- Many of the questions in the concern issues related to corpora and tagging.
- Content Analysis Resources
- Qualitative Text Analysis, Concordances, etc.
Information Retrieval
-
The SMART IR system
- ACM SIGIR
- Managing Gigabytes
- TREC conference
- Text-based Intelligent Systems (Bruce Croft)
- ACM SIGIR
Information Extraction/Wrapper Induction
-
Introduction to
Information Extraction Technology. A tutorial by
Douglas E. Appelt and David Israel.
- Web -> KB. CMU World Wide Knowledge Base project (Tom Mitchell). Has a lot of the best recent probabilistic model IE work, and links to data sets.
- RISE: Repository of Online Information Sources Used in Information Extraction Tasks, including links to people, papers, and many widely used data sets, etc. (Ion Muslea)
- Message Understanding Conference (MUC) information. A US government funded information extraction exercise (from the 1990s).
- Web IR and IE (Einat Amitay). Various links on IR and IE on the web.
- Web -> KB. CMU World Wide Knowledge Base project (Tom Mitchell). Has a lot of the best recent probabilistic model IE work, and links to data sets.
People's homepages
Home pages with something useful on them.
-
University of Texas at Austin
Machine Learning Research Group
- Steven Abney (until 1997)
- Adam Berger
- Steven Abney (until 1997)
- Various stuff on statistical MT and maximum entropy models
- Alex Chengyu Fang
- Provides a lot of info on the kinds of things they get up to at UCL, without actually giving you anything to play with yourself.