1

I was looking up about the BM25 algorithm and I have an image related question about how IDF is calculated.

I saw the image below is the IDF difference between BM25 and TF-IDF.

1

The IDF formula for TF-IDF and the IDF formula for BM25 are shown below.

IDF = Math.log(N / df) // TF-IDF


IDF = Math.log(1 + (N - df + 0.5) / (df + 0.5)) // BM25

It seems that a graph like the image above cannot be produced with the BM25's IDF calculation method. Maybe I'm missing something?

I tried to draw a graph using python.


import matplotlib.pyplot as plt
import math

N = 100
plot_data = []
for df in range(1,17):
    idf = math.log(1+(N-df+0.5)/(df+0.5))
    plot_data.append(idf)
    
plt.plot(plot_data, label='BM25_IDF')
plt.legend()

plot_data = []
for df in range(1, 17):
    idf = math.log(N/(df+1))
    plot_data.append(idf)
plt.plot(plot_data, label='idf')
# plt.plot(idf_list_bias, label='idf')
plt.legend()

enter image description here

SAXYCOW
  • 11
  • 3

1 Answers1

0

With my poor mathematical knowledge, I found that the following graph was created using The probabilistic IDF.

The code and graph used are shown below.

import matplotlib.pyplot as plt
import math

N = 300
plot_data = []

for df in range(1, 300):
    idf = math.log(N/(df+1))
    plot_data.append(idf)
plt.plot(plot_data, label='TF-IDF_IDF')
plt.legend()

#log((N - df(w) + 0.5)/(df(w) + 0.5))
N = 300
plot_data = []
for df in range(1,300):
    idf = math.log((N-df+0.5)/(df+0.5))
    plot_data.append(idf)
    

plt.plot(plot_data, label='BM25_IDF')
plt.legend()

enter image description here

However, it still seems a little different from the IDF method of the BM25 I know. I will have to read and study more related documents...

Thanks to all of you who read this question.

SAXYCOW
  • 11
  • 3