|
xapian-core
1.5.1
|
Language Model weighting with Jelinek-Mercer smoothing. More...
#include <weight.h>
Public Member Functions | |
| LMJMWeight (double lambda=0.0) | |
| Construct a LMJMWeight. | |
| double | get_sumpart (Xapian::termcount wdf, Xapian::termcount doclen, Xapian::termcount uniqterm, Xapian::termcount wdfdocmax) const |
| Calculate the weight contribution for this object's term to a document. | |
| double | get_maxpart () const |
| Return an upper bound on what get_sumpart() can return for any document. | |
| std::string | name () const |
| Return the name of this weighting scheme, e.g. | |
| std::string | serialise () const |
| Return this object's parameters serialised as a single string. | |
| LMJMWeight * | unserialise (const std::string &serialised) const |
| Unserialise parameters. | |
| LMJMWeight * | create_from_parameters (const char *params) const |
| Create from a human-readable parameter string. | |
| Public Member Functions inherited from Xapian::Weight | |
| Weight () | |
| Default constructor, needed by subclass constructors. | |
| virtual | ~Weight () |
| Virtual destructor, because we have virtual methods. | |
| virtual double | get_sumextra (Xapian::termcount doclen, Xapian::termcount uniqterms, Xapian::termcount wdfdocmax) const |
| Calculate the term-independent weight component for a document. | |
| virtual double | get_maxextra () const |
| Return an upper bound on what get_sumextra() can return for any document. | |
Additional Inherited Members | |
| Static Public Member Functions inherited from Xapian::Weight | |
| static const Weight * | create (const std::string &scheme, const Registry ®=Registry()) |
| Return the appropriate weighting scheme object. | |
| Protected Types inherited from Xapian::Weight | |
| enum | stat_flags { COLLECTION_SIZE = 0 , RSET_SIZE = 0 , AVERAGE_LENGTH = 4 , TERMFREQ = 1 , RELTERMFREQ = 1 , QUERY_LENGTH = 0 , WQF = 0 , WDF = 2 , DOC_LENGTH = 8 , DOC_LENGTH_MIN = 16 , DOC_LENGTH_MAX = 32 , WDF_MAX = 64 , COLLECTION_FREQ = 1 , UNIQUE_TERMS = 128 , TOTAL_LENGTH = 256 , WDF_DOC_MAX = 512 , UNIQUE_TERMS_MIN = 1024 , UNIQUE_TERMS_MAX = 2048 , DB_DOC_LENGTH_MIN = 4096 , DB_DOC_LENGTH_MAX = 8192 , DB_UNIQUE_TERMS_MIN = 16384 , DB_UNIQUE_TERMS_MAX = 32768 , DB_WDF_MAX = 65536 , IS_BOOLWEIGHT_ = static_cast<int>(0x80000000) } |
| Stats which the weighting scheme can use (see need_stat()). More... | |
| Protected Member Functions inherited from Xapian::Weight | |
| void | need_stat (stat_flags flag) |
| Tell Xapian that your subclass will want a particular statistic. | |
| Weight (const Weight &) | |
| Don't allow copying. | |
| Xapian::doccount | get_collection_size () const |
| The number of documents in the collection. | |
| Xapian::doccount | get_rset_size () const |
| The number of documents marked as relevant. | |
| Xapian::doclength | get_average_length () const |
| The average length of a document in the collection. | |
| Xapian::doccount | get_termfreq () const |
| The number of documents which this term indexes. | |
| Xapian::doccount | get_reltermfreq () const |
| The number of relevant documents which this term indexes. | |
| Xapian::termcount | get_collection_freq () const |
| The collection frequency of the term. | |
| Xapian::termcount | get_query_length () const |
| The length of the query. | |
| Xapian::termcount | get_wqf () const |
| The within-query-frequency of this term. | |
| Xapian::termcount | get_doclength_upper_bound () const |
| An upper bound on the maximum length of any document in the shard. | |
| Xapian::termcount | get_doclength_lower_bound () const |
| A lower bound on the minimum length of any document in the shard. | |
| Xapian::termcount | get_wdf_upper_bound () const |
| An upper bound on the wdf of this term in the shard. | |
| Xapian::totallength | get_total_length () const |
| Total length of all documents in the collection. | |
| Xapian::termcount | get_unique_terms_upper_bound () const |
| A lower bound on the number of unique terms in any document in the shard. | |
| Xapian::termcount | get_unique_terms_lower_bound () const |
| An upper bound on the number of unique terms in any document in the shard. | |
| Xapian::termcount | get_db_doclength_upper_bound () const |
| An upper bound on the maximum length of any document in the database. | |
| Xapian::termcount | get_db_doclength_lower_bound () const |
| A lower bound on the minimum length of any document in the database. | |
| Xapian::termcount | get_db_unique_terms_upper_bound () const |
| A lower bound on the number of unique terms in any document in the database. | |
| Xapian::termcount | get_db_unique_terms_lower_bound () const |
| An upper bound on the number of unique terms in any document in the database. | |
| Xapian::termcount | get_db_wdf_upper_bound () const |
| An upper bound on the wdf of this term in the database. | |
Language Model weighting with Jelinek-Mercer smoothing.
As described in:
Zhai, C., & Lafferty, J.D. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22, 179-214.
|
inlineexplicit |
Construct a LMJMWeight.
| lambda | A parameter strictly between 0 and 1 which linearly interpolates between the maximum likelihood model (the limit as λ→0) and the collection model (the limit as λ→1). |
Values of λ around 0.1 are apparently optimal for short queries and around 0.7 for long queries. If lambda is out of range (i.e. <= 0 or >= 1) then the λ value used is chosen dynamically based on the query length using the formula:
(query_length - 1) / 10.0
The result is clamped to 0.1 for query_length <= 2, and to 0.7 for query_length >= 8.
References Xapian::Weight::COLLECTION_FREQ, Xapian::Weight::DOC_LENGTH, Xapian::Weight::DOC_LENGTH_MIN, Xapian::Weight::need_stat(), Xapian::Weight::QUERY_LENGTH, Xapian::Weight::TOTAL_LENGTH, Xapian::Weight::WDF, Xapian::Weight::WDF_MAX, and Xapian::Weight::WQF.
Referenced by create_from_parameters(), and unserialise().
|
virtual |
Create from a human-readable parameter string.
| params | string containing weighting scheme parameter values. |
Reimplemented from Xapian::Weight.
References LMJMWeight().
|
virtual |
Return an upper bound on what get_sumpart() can return for any document.
This information is used by the matcher to perform various optimisations, so strive to make the bound as tight as possible.
Implements Xapian::Weight.
|
virtual |
Calculate the weight contribution for this object's term to a document.
The parameters give information about the document which may be used in the calculations:
| wdf | The within document frequency of the term in the document. You need to call need_stat(WDF) if you use this value. |
| doclen | The document's length (unnormalised). You need to call need_stat(DOC_LENGTH) if you use this value. |
| uniqterms | Number of unique terms in the document. You need to call need_stat(UNIQUE_TERMS) if you use this value. |
| wdfdocmax | Maximum wdf value in the document. You need to call need_stat(WDF_DOC_MAX) if you use this value. |
You can rely of wdf <= doclen if you call both need_stat(WDF) and need_stat(DOC_LENGTH) - this is trivially true for terms, but Xapian also ensure it's true for OP_SYNONYM, where the wdf is approximated.
Implements Xapian::Weight.
|
virtual |
Return the name of this weighting scheme, e.g.
"bm25+".
This is the name that the weighting scheme gets registered under when passed to Xapian:Registry::register_weighting_scheme().
As a result:
For 1.4.x and earlier we recommended returning the full namespace-qualified name of your class here, but now we recommend returning a just the name in lower case, e.g. "foo" instead of "FooWeight", "bm25+" instead of "Xapian::BM25PlusWeight".
If you don't want to support creation via Weight::create() or the remote backend, you can use the default implementation which simply returns an empty string.
Reimplemented from Xapian::Weight.
|
virtual |
Return this object's parameters serialised as a single string.
If you don't want to support the remote backend, you can use the default implementation which simply throws Xapian::UnimplementedError.
Reimplemented from Xapian::Weight.
|
virtual |
Unserialise parameters.
This method unserialises parameters serialised by the serialise() method and allocates and returns a new object initialised with them.
If you don't want to support the remote backend, you can use the default implementation which simply throws Xapian::UnimplementedError.
Note that the returned object will be deallocated by Xapian after use with "delete". If you want to handle the deletion in a special way (for example when wrapping the Xapian API for use from another language) then you can define a static operator delete method in your subclass as shown here: https://trac.xapian.org/ticket/554#comment:1
| serialised | A string containing the serialised parameters. |
Reimplemented from Xapian::Weight.
References LMJMWeight().