JYuan Learning Log: Notes on Solr Query and Relevancy Tuning

How do I give a negative (or very low) boost to documents that match a query?
http://wiki.apache.org/solr/SolrRelevancyFAQ
Correct way:
q = foo^100 bar^100 (: -xxx)^999

==> The following query doesn't work as it will boost that match a little bit.
q = foo^100 bar^100 xxx^0.00001 # NOT WHAT YOU WANT

Scoring
Term frequency—tf: The more times a term is found in a document's field, the higher the score it gets. This concept is most intuitive
Term frequency—tf: The more times a term is found in a document's field, the higher the score it gets. This concept is most intuitive.
Co-ordination factor—coord: The greater the number of query clauses that
match a document, the greater the score will be. Any mandatory clauses
must match and the prohibited ones must not match, leaving the relevance of
this piece of the score to situations where there are optional clauses.
Field length—fieldNorm: The shorter the matching field is, measured in
number of indexed terms, the greater the matching document"s score will
be. Norms for a field can be marked as omitted in the
schema with the omitNorms attribute, effectively neutralizing this component
of the score.

Alternative Scoring Models
The scoring factors described above relate to Lucene"s default scoring model. It"s
known as the Vector Space Model, also referred to as simply TF-IDF due to its
most prominent components.

In Lucene the relevance model is implemented by a Similarity subclass, and
Solr provides a SimilarityFactory for each one.

Query-time and index-time boosting
fuzzy query
a_name:Smashing~
For the fuzzy query case seen here,
you could use dismax"s bq parameter and give it a non-fuzzy version of the user"s query. That will have the effect of boosting an exact match stronger.

Lucene"s DisjunctionMaxQuery
fieldA:rock^2 OR fieldB:rock^1.2 OR fieldC:rock^0.5

The difference between that boolean OR query and DisjunctionMaxQuery is only in the scoring.
if the intention is to search for the same text across multiple fields, then it"s better to
use the maximum sub-clause score rather than the sum. Dismax will take the max
whereas boolean uses the sum.

The dismax query parser has a tie parameter, which is between zero (the default)
and one. By raising this value above zero, it serves as a tie-breaker to give an edge to
a document that matched a term in multiple elds versus one. At the highest value
of 1, it scores very similar to that of a boolean query.

Boosting: Automatic phrase boosting
+(Billy Joel) "Billy Joel"

Configuring automatic phrase boosting
Automatic phrase boosting is not enabled by default. In order to use this feature, you
must use the pf parameter, which is an abbreviation of "phrase fields".
You should start with the same value and then make adjustments.
Common reasons to vary pf from qf:
• To use different (typically lower) boost factors so that the impact of phrase
boosting isn"t overpowering.

Start with the same value used as qf, but with boosts cut in half. Remove fields that are always one
term, such as an identifier. Use common-grams or shingling, as described in Chapter 10, Scaling Solr, to
increase performance.

Phrase slop conguration
"Billy Joel"~1
dismax adds two parameters to automatically set the slop: qs for any explicit phrase
queries that the user entered and ps for the phrase boosting mentioned previously.

Partial phrase boosting

Boosting: Boost queries
bq=r_type:Album^2 (*:* -r_type:Compilation)^2 r_official:Official^2

Boosting: Boost functions

boost=recip(map(rord(r_event_date_earliest),0,0,99000) ,1,95000,95000)

JYuan Learning Log

Wednesday, February 26, 2014

Notes on Solr Query and Relevancy Tuning

No comments:

Post a Comment