JYuan Learning Log: Learning Solr in Action

Hierarchical matching

The More Like
This Handler in Solr is able to take in any document, extract the interesting terms from the document, and automatically use those terms as a keyword search to find similar documents. It internally extracts the interesting terms from a document by treating the document as a term vector and extracting the highest matching terms based upon a tf-idf similarity calculation. It can then use those top-ranking terms as a query for other similar documents.

http://localhost:8983/solr/jobs/mlt/?
df=jobdescription&
q=J2EE&
mlt.fl=jobtitle,jobdescription
This query will run a search for documents containing the keyword J2EE, find the top matching document, and then perform statistical analysis on the text in the jobtitle and jobdescription fields. It is important to note that the More Like This functionality requires that any fields used for statistical analysis (specified in the mlt.fl parameter) either have termVectors="true" or stored="true" set in the schema.xml. Enabling term vectors is faster, as the More Like This implementation will otherwise have to process the stored content to get term vectors at query time

mlt.interestingTerms=details set to bring back information about which terms were used for the recommendations query.

it’s also possible to use the tf-idf calculation to represent per-term boosts by specifying the mlt.boost=true parameter.

You can often overcome some of this noise by putting good stop word lists in place. An alternative approach might be to build an analyzer that does part-of-speech analysis and only makes use of nouns from within your text.

It is also possible to make use of the More Like This functionality as a search component along with a typical search query, as opposed to hitting a separate request handler.

Solr’s clustering component can enable you to find similarities between documents that can ultimately be used to find related concepts not necessarily present in your initial query or document.

q=.Net Jobs OR ("software engineer" OR "c#" OR ".net developer" OR "developer")^0.25
q=.Net Jobs AND (*:* OR "software engineer" OR "c#" OR ".net developer" OR "developer")
q=.Net Jobs AND ("software engineer" OR "c#" OR ".net developer" OR "developer")

http://localhost:8983/solr/jobs/clustering?
q=content:(solr OR lucene)&
rows=100&
carrot.title=jobtitle&
carrot.snippet=jobtitle&
LingoClusteringAlgorithm.desiredClusterCountBase=25

Collaborative filtering
Collaborative filtering makes use of collective intelligence, or the wisdom of the crowd, to enable your users to effectively tune the algorithm themselves based upon their behavior. In practice, the algorithm is outsourcing the similarity ranking of documents to your users, allowing their actions to adjust the relevancy weighting on a per-item basis.

If you did want to pull the users back with weights based upon how many documents they appear in, you could facet on the user field (&facet=true&facet.field=user&facet.mincount=1).

Lemmatization is the process of determining the root form of a word. In contrast with stemming, which tries to algorithmically find a common base for a word, lemmatization typically makes use of dictionaries to find the root form of a word

Morphological analysis tools, using statistical NLP techniques to learn about a language structure from a large corpus of text from that language, can perform quite well.

JYuan Learning Log

Thursday, November 6, 2014

Learning Solr in Action

No comments:

Post a Comment