2

Im trying to use Lucene for calculating similarities on a number of Documents. For the similarity calculation im using the BM25 und the VSM.

Besides Lucene Im using GATE, an OpenSource Framework that performs Language Processing tasks.

When Im trying to calculate similarities between Documents (15), I encountered a strange behavior.

With VSM my results look like:

Post-processing links before ranking
Ranking all links by similarities
3/54 links above similarity 0.15 threshold
54/54 top-most 1.0 similar links
Post-processing links after ranking
Traced 3 link(s) in 9x6 space:
Link = [12695.xml(0,58320)@Bug[15009] | 12713.xml(0,18247)@Feature[1974]]@[1.6188]
Link = [5822.xml(0,10098)@Bug[1434] | 12713.xml(0,18247)@Feature[1974]]@[1.5119]
Link = [12694.xml(0,1504)@Bug[188] | 12713.xml(0,18247)@Feature[1974]]@[0.2702]
Clearing previous runtime results...

Score breakdown:
6.860396E-7 = (MATCH) max of:
  0.0 = (MATCH) MatchAllDocsQuery, product of:
    0.0 = boost
    0.0032560423 = queryNorm
  6.860396E-7 = (MATCH) product of:
    0.0034322562 = (MATCH) sum of:
      0.0017054792 = (MATCH) weight(TERM:http in 1) [DefaultSimilarity], result of:
        0.0017054792 = score(doc=1,freq=2.0), product of:
          0.0045762537 = queryWeight, product of:
            1.4054651 = idf(docFreq=3, maxDocs=6)
            0.0032560423 = queryNorm
          0.37268022 = fieldWeight in 1, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            1.4054651 = idf(docFreq=3, maxDocs=6)
            0.1875 = fieldNorm(doc=1)
      8.6338853E-4 = (MATCH) weight(TERM:use in 1) [DefaultSimilarity], result of:
        8.6338853E-4 = score(doc=1,freq=2.0), product of:
          0.0032560423 = queryWeight, product of:
            1.0 = idf(docFreq=5, maxDocs=6)
            0.0032560423 = queryNorm
          0.26516503 = fieldWeight in 1, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            1.0 = idf(docFreq=5, maxDocs=6)
            0.1875 = fieldNorm(doc=1)
      8.6338853E-4 = (MATCH) weight(TERM:use in 1) [DefaultSimilarity], result of:
        8.6338853E-4 = score(doc=1,freq=2.0), product of:
          0.0032560423 = queryWeight, product of:
            1.0 = idf(docFreq=5, maxDocs=6)
            0.0032560423 = queryNorm
          0.26516503 = fieldWeight in 1, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            1.0 = idf(docFreq=5, maxDocs=6)
            0.1875 = fieldNorm(doc=1)
    1.9988007E-4 = coord(3/15009)

With the BM25 Im getting some strange behavior.

Post-processing links before ranking
Ranking all links by similarities
40/54 links above similarity 0.15 threshold
54/54 top-most 1.0 similar links
Post-processing links after ranking
Traced 40 link(s) in 9x6 space:
Link = [12695.xml(0,58320)@Bug[15009] | 12713.xml(0,18247)@Feature[1974]]@[10768.2471]
Link = [5822.xml(0,10098)@Bug[1434] | 12713.xml(0,18247)@Feature[1974]]@[1798.1300]
Link = [12695.xml(0,58320)@Bug[15009] | 13091.xml(0,1721)@Feature[216]]@[965.0315]
Link = [5822.xml(0,10098)@Bug[1434] | 13091.xml(0,1721)@Feature[216]]@[372.0819]
Link = [12694.xml(0,1504)@Bug[188] | 12713.xml(0,18247)@Feature[1974]]@[174.2649]
Link = [12695.xml(0,58320)@Bug[15009] | 12700.xml(0,410)@Feature[36]]@[97.6378]
Link = [5822.xml(0,10098)@Bug[1434] | 1910.xml(0,237)@Feature[21]]@[46.4066]
Link = [12694.xml(0,1504)@Bug[188] | 13091.xml(0,1721)@Feature[216]]@[35.8532]
Link = [5822.xml(0,10098)@Bug[1434] | 12701.xml(0,137)@Feature[14]]@[29.6364]
Link = [12698.xml(0,362)@Bug[56] | 12713.xml(0,18247)@Feature[1974]]@[22.4652]
Link = [132.xml(0,409)@Bug[33] | 12713.xml(0,18247)@Feature[1974]]@[21.1697]
Link = [5822.xml(0,10098)@Bug[1434] | 12700.xml(0,410)@Feature[36]]@[16.7317]
Link = [132.xml(0,409)@Bug[33] | 13091.xml(0,1721)@Feature[216]]@[15.8749]
Link = [12697.xml(0,257)@Bug[34] | 12713.xml(0,18247)@Feature[1974]]@[15.5943]
Link = [12696.xml(0,272)@Bug[40] | 12713.xml(0,18247)@Feature[1974]]@[14.8670]
Link = [5822.xml(0,10098)@Bug[1434] | 12702.xml(0,88)@Feature[9]]@[14.8045]
Link = [12694.xml(0,1504)@Bug[188] | 1910.xml(0,237)@Feature[21]]@[13.8415]
Link = [12694.xml(0,1504)@Bug[188] | 12700.xml(0,410)@Feature[36]]@[11.7942]
Link = [12703.xml(0,331)@Bug[43] | 12713.xml(0,18247)@Feature[1974]]@[11.2949]
Link = [12699.xml(0,616)@Bug[67] | 12713.xml(0,18247)@Feature[1974]]@[9.4193]
Link = [12695.xml(0,58320)@Bug[15009] | 12701.xml(0,137)@Feature[14]]@[8.6146]
Link = [12699.xml(0,616)@Bug[67] | 13091.xml(0,1721)@Feature[216]]@[7.1386]
Link = [12695.xml(0,58320)@Bug[15009] | 1910.xml(0,237)@Feature[21]]@[5.9274]
Link = [12698.xml(0,362)@Bug[56] | 13091.xml(0,1721)@Feature[216]]@[4.4054]
Link = [12699.xml(0,616)@Bug[67] | 12700.xml(0,410)@Feature[36]]@[4.0292]
Link = [12703.xml(0,331)@Bug[43] | 13091.xml(0,1721)@Feature[216]]@[3.3257]
Link = [12696.xml(0,272)@Bug[40] | 13091.xml(0,1721)@Feature[216]]@[2.5366]
Link = [12695.xml(0,58320)@Bug[15009] | 12702.xml(0,88)@Feature[9]]@[2.2157]
Link = [12699.xml(0,616)@Bug[67] | 1910.xml(0,237)@Feature[21]]@[2.0420]
Link = [12697.xml(0,257)@Bug[34] | 13091.xml(0,1721)@Feature[216]]@[0.9461]
Link = [12694.xml(0,1504)@Bug[188] | 12702.xml(0,88)@Feature[9]]@[0.9092]
Link = [12694.xml(0,1504)@Bug[188] | 12701.xml(0,137)@Feature[14]]@[0.8928]
Link = [12697.xml(0,257)@Bug[34] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12696.xml(0,272)@Bug[40] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12703.xml(0,331)@Bug[43] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12698.xml(0,362)@Bug[56] | 12702.xml(0,88)@Feature[9]]@[0.8328]
Link = [12703.xml(0,331)@Bug[43] | 12701.xml(0,137)@Feature[14]]@[0.8178]
Link = [12698.xml(0,362)@Bug[56] | 12701.xml(0,137)@Feature[14]]@[0.8178]
Link = [12696.xml(0,272)@Bug[40] | 12701.xml(0,137)@Feature[14]]@[0.8178]
Link = [12697.xml(0,257)@Bug[34] | 12701.xml(0,137)@Feature[14]]@[0.8178]

The BM25 links everything because of the "good" or high results. The Explanation looks as followed:

Score breakdown:
2.2157059 = (MATCH) max of:
  0.0 = (MATCH) MatchAllDocsQuery, product of:
    0.0 = boost
    1.0 = queryNorm
  2.2157059 = (MATCH) sum of:
    1.3065486 = (MATCH) weight(TERM:http in 1) [BM25Similarity], result of:
      1.3065486 = score(doc=1,freq=2.0 = termFreq=2.0
), product of:
        0.6931472 = idf(docFreq=3, maxDocs=6)
        1.8849511 = tfNorm, computed from:
          2.0 = termFreq=2.0
          1.2 = parameter k1
          0.75 = parameter b
          746.8333 = avgFieldLength
          28.444445 = fieldLength
    0.4545787 = (MATCH) weight(TERM:use in 1) [BM25Similarity], result of:
      0.4545787 = score(doc=1,freq=2.0 = termFreq=2.0
), product of:
        0.24116206 = idf(docFreq=5, maxDocs=6)
        1.8849511 = tfNorm, computed from:
          2.0 = termFreq=2.0
          1.2 = parameter k1
          0.75 = parameter b
          746.8333 = avgFieldLength
          28.444445 = fieldLength
    0.4545787 = (MATCH) weight(TERM:use in 1) [BM25Similarity], result of:
      0.4545787 = score(doc=1,freq=2.0 = termFreq=2.0
), product of:
        0.24116206 = idf(docFreq=5, maxDocs=6)
        1.8849511 = tfNorm, computed from:
          2.0 = termFreq=2.0
          1.2 = parameter k1
          0.75 = parameter b
          746.8333 = avgFieldLength
          28.444445 = fieldLength

For debugging reasons I deactivated term boost and other stuff to see the real results. Normally all values where normed to 1 or 0 if they were above 1 or under 0.

Im using Lucene 5.0.0. The Documents are just usual tickets that have references to other tickets.

The similarities are implemented as:

new BM25Similarity(k1, b); where k1 = 1.2 and b = 0.75 (defaults). (BM25)
new DefaultSimilarity() (VSM)

How is it possible that the score is so different? As I can see everything competed by the VSM is smaller.

Does anyone encountered this strange behavior?

Id appreciate any kind of help!

-- Edit

Im also wondering that queryNorm is equal to 1.0 in each Query of the BM25. But in the VSM it is different for each query.

According to this: Lucene scoring: in what context is queryNorm used?

queryNorm(q) is a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable.

It should always be the same right?

Community
  • 1
  • 1
PaulSchell
  • 182
  • 2
  • 10

0 Answers0