http://discovery-grindstone.blogspot.com/2014/01/cjk-with-solr-for-libraries-part-7.html
Equate Traditional Characters With Simplified Characters
The CJKBigramFilter, new with Solr 3.6, allows us to generate both the unigrams and the bigrams for CJK scripts only.
The CJKBigramFilter must be fed appropriate values for the token type.
a. ICUTokenizer
b. StandardTokenizer
Note that the StandardTokenizer output does not separate the Hangul character from the Latin characters at the end.
c. ClassicTokenizer
I believe ClassicTokenizer will also assign token types, probably the same way StandardTokenizer does.
2. ICU Script Translations
Solr makes the following script translations available via the solr.ICUTransformFilterFactory:
Han Traditional <--> Simplified
Katakana <--> Hiragana
3. ICU Folding Filter
We have already been using solr.ICUFoldingFilterFactory for case folding, e.g. normalizing "A" to "a
its an efficient single-pass through the string. For practical purposes this means you can use this factory as a better substitute for the combined behavior of ASCIIFoldingFilter, LowerCaseFilter, and ICUNormalizer2Filter
Solr Fieldtype Definition
positionIncrementGap attribute on fieldType
This setting is all about trying to keep your matches within a single field value for a multivalued field.
the values in a multivalued field are stored adjacently, and this setting is the number of pretend tokens between the field values. Thus, a large value keeps phrase queries from matching some words at the end of one field value, and some words in the beginning of the following field value.
autoGeneratePhraseQueries attribute on fieldType
As LUCENE-2458 describes, prior to Solr 3.1, if more than one token was created for whitespace delimited text, then a phrase query was automatically generated. While this behavior is generally desired for European languages, it is not desired for CJK.
CUTokenizerFactory
The section above on CJKBigram analysis shows that the ICUTokenizer is likely to be better than the StandardTokenizer for tokenizing CJK characters into typed unigrams, as needed by the CJKBigramFilter.
CJKWidthFilterFactory
It may be that this is completely unnecessary, but on the off chance that the script translations don't accommodate half-width characters, I go ahead and normalize them here.
CJK with Solr for Libraries, part 8
http://discovery-grindstone.blogspot.com/2014/01/cjk-with-solr-for-libraries-part-8.html
I coded the above using rspec-solr (http://rubydoc.info/github/sul-dlss/rspec-solr)
http://explain.solr.pl/
mm setting
Our mm setting was 6<-1 6<90%, which translates to: for 1 to 6 clauses, all are required; for more than 6 clauses, 90% (rounded down) are required.
I chose this mm setting for CJK:
3<86%
Per the mm spec, this says for three or fewer "clauses" (tokens), all are required, but for four or more tokens, only 86% (rounded down) are required. This is perfect for CJK queries of 6 or fewer characters, a tad high for 7 characters, and perfect again for 8 characters, and seems to be the best fit available.
http://discovery-grindstone.blogspot.com/2014/01/cjk-with-solr-for-libraries-part-9.html
qs setting for phrase searches
Dismax and edismax have a "query phrase slop" parameter, qs, which is the distance allowed between tokens when the query has explicitly indicated a phrase search with quotation marks. Probably from back in our stopword days, we use a setting of qs=1, meaning a query of "women's literature", with the quotes, is allowed to match results containing 'women and literature' as well as 'women in literature' in addition to 'women's literature'. Because of the magic of pf sorting the best matches first, this has worked just fine for our users up until now. However, with CJK queries, this is undesirable -- an explicit phrase query in CJK should only match the exact characters entered, with nothing inserted between them:qs=0
catch-all field
Halfwidth and fullwidth forms
http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms
Equate Traditional Characters With Simplified Characters
The CJKBigramFilter, new with Solr 3.6, allows us to generate both the unigrams and the bigrams for CJK scripts only.
The CJKBigramFilter must be fed appropriate values for the token type.
a. ICUTokenizer
b. StandardTokenizer
Note that the StandardTokenizer output does not separate the Hangul character from the Latin characters at the end.
c. ClassicTokenizer
I believe ClassicTokenizer will also assign token types, probably the same way StandardTokenizer does.
2. ICU Script Translations
Solr makes the following script translations available via the solr.ICUTransformFilterFactory:
Han Traditional <--> Simplified
Katakana <--> Hiragana
3. ICU Folding Filter
We have already been using solr.ICUFoldingFilterFactory for case folding, e.g. normalizing "A" to "a
its an efficient single-pass through the string. For practical purposes this means you can use this factory as a better substitute for the combined behavior of ASCIIFoldingFilter, LowerCaseFilter, and ICUNormalizer2Filter
Solr Fieldtype Definition
positionIncrementGap attribute on fieldType
This setting is all about trying to keep your matches within a single field value for a multivalued field.
the values in a multivalued field are stored adjacently, and this setting is the number of pretend tokens between the field values. Thus, a large value keeps phrase queries from matching some words at the end of one field value, and some words in the beginning of the following field value.
autoGeneratePhraseQueries attribute on fieldType
As LUCENE-2458 describes, prior to Solr 3.1, if more than one token was created for whitespace delimited text, then a phrase query was automatically generated. While this behavior is generally desired for European languages, it is not desired for CJK.
CUTokenizerFactory
The section above on CJKBigram analysis shows that the ICUTokenizer is likely to be better than the StandardTokenizer for tokenizing CJK characters into typed unigrams, as needed by the CJKBigramFilter.
CJKWidthFilterFactory
It may be that this is completely unnecessary, but on the off chance that the script translations don't accommodate half-width characters, I go ahead and normalize them here.
CJK with Solr for Libraries, part 8
http://discovery-grindstone.blogspot.com/2014/01/cjk-with-solr-for-libraries-part-8.html
I coded the above using rspec-solr (http://rubydoc.info/github/sul-dlss/rspec-solr)
http://explain.solr.pl/
mm setting
Our mm setting was 6<-1 6<90%, which translates to: for 1 to 6 clauses, all are required; for more than 6 clauses, 90% (rounded down) are required.
I chose this mm setting for CJK:
3<86%
Per the mm spec, this says for three or fewer "clauses" (tokens), all are required, but for four or more tokens, only 86% (rounded down) are required. This is perfect for CJK queries of 6 or fewer characters, a tad high for 7 characters, and perfect again for 8 characters, and seems to be the best fit available.
http://discovery-grindstone.blogspot.com/2014/01/cjk-with-solr-for-libraries-part-9.html
qs setting for phrase searches
Dismax and edismax have a "query phrase slop" parameter, qs, which is the distance allowed between tokens when the query has explicitly indicated a phrase search with quotation marks. Probably from back in our stopword days, we use a setting of qs=1, meaning a query of "women's literature", with the quotes, is allowed to match results containing 'women and literature' as well as 'women in literature' in addition to 'women's literature'. Because of the magic of pf sorting the best matches first, this has worked just fine for our users up until now. However, with CJK queries, this is undesirable -- an explicit phrase query in CJK should only match the exact characters entered, with nothing inserted between them:qs=0
catch-all field
Halfwidth and fullwidth forms
http://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms
No comments:
Post a Comment