Monday, March 3, 2014

Notes on Solr Entity Extraction

http://info.basistech.com/blog/bid/235459/Enhancing-Apache-Solr-with-Entity-Extraction
http://searchhub.org/2013/06/27/poor-mans-entity-extraction-with-solr/
Acronym Extraction
<fieldType name="caps" class="solr.TextField" sortMissingLast="true" omitNorms="true">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.PatternCaptureGroupFilterFactory"
            pattern="((?:[A-Z]\.?){3,})" preserve_original="false"
    />
</analyzer>
var links_regexp =
       /(https?:\/\/[a-zA-Z\-_0-9.]+(?:\/[a-zA-Z\-_0-9.]+)*\/?)/g;
doc.setField("links", getMatches(links_regexp, content));
<fieldType name="urls" class="solr.TextField" sortMissingLast="true" omitNorms="true">
  <analyzer>
    <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
    <filter class="solr.TypeTokenFilterFactory"
            types="url_type.txt" useWhitelist="true"/>
  </analyzer>
</fieldType>
Where email_type.txt contains <EMAIL> and url_type.txt contains <URL>, both literally.  

http://www.searchbox.com/named-entity-recognition-ner-in-solr/
Use http://nlp.stanford.edu/software/CRF-NER.shtml

*Named Entity Recognition (NER) and Solr*
http://www.searchbox.com/named-entity-recognition-ner-in-solr/

http://nlp.stanford.edu/software/jenny-ner-2007.pdf
http://blog.csdn.net/limisky/article/details/17025861
http://limisky.0fees.net/?page_id=2
https://code.google.com/p/uima-nerc/
NERDemo.java
AbstractSequenceClassifier classifier = CRFClassifier.getClassifierNoExceptions(serializedClassifier);
List out = classifier.classify(fileContents);
classifier.classifyToString(s1)
classifier.classifyToString(s2, "xml", true)
classifier.classifyWithInlineXML(s2)
annotatedText = ner.getAnnotatedText(text, useXML)

A typical implementation will also implement the ResourceLoaderAware interface to load the resources by the internal components.
http://nlp.stanford.edu/software/crf-faq.shtml#extend
Can an existing model be extended?

*Unfortunately, no.*

No comments:

Post a Comment