Wednesday, January 15, 2014

Stemming and Lemmatisation

Lemmatisation is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. It's the algorithmic process of determining the lemma for a given word.

that one might look up in a dictionary, is called the lemma词条 for the word. The combination of the base form with the part of speech is often called the lexeme词位 of the word.

a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.
the purpose of stemming is not to produce the appropriate lemma – that is a more challenging task that requires knowledge of context. The main purpose of stemming is to map different forms of a word to a single form.

Lemmatization is closely related to stemming but unlike stemming, which operates only on a single word at a time, lemmatization operates on the full text and therefore can discriminate between words that have different meanings depending on part of speech.

Stanford CoreNLP
http://nlp.stanford.edu/software/corenlp.shtml

Lemmatizer
https://wiki.searchtechnologies.com/index.php/Lemmatizer
LemmaGen
http://lemmatise.ijs.si/

KeywordRepeatFilter and RemoveDuplicatesTokenFilterFactory
In Solr 4.3, the KeywordRepeatFilterFactory has been added to assist this functionality. This filter emits two tokens for each input token, one of them is marked with the Keyword attribute. Stemmers that respect keyword attributes will pass through the token so marked without change. So the effect of this filter would be to index both the original word and the stemmed version. The 4 stemmers listed above all respect the keyword attribute.

For terms that are not changed by stemming, this will result in duplicate, identical tokens in the document. This can be alleviated by adding the RemoveDuplicatesTokenFilterFactory.

No comments:

Post a Comment