-1

I have two lists, one containing true values selected by humans and a second list with extracted values. I would like to measure how well the pipeline is performing based on how many true values are contained in the extracted list. Example:

extracted_value = ["value", "of", "words", "that", "were", "tracked"]
real_value = ["value", "words", "that"]

I need a metric that describes: 3 out of 3 real values were extracted

For multiple Documents: 5 out of 10 real values were extracted 2 out of 3 real values were extracted 1 out of 9 real values were extracted

Based on the individual comparison, can I get a score that describes how well the extracted keywords perform on average across all documents?

3 Answers3

1

Will something simple like this work?

score = len([x for x in real_value if x in extracted_value])/len(extracted_value)
print(score)
>>> 0.5
svfat
  • 3,273
  • 1
  • 15
  • 34
  • Thats this helped: I Just changed the lists: score = len([x for x in extracted_value if x in real_value])/len(real_value) print(score) and this works. Do you have an idea how to average on all documents? – eliza nyambu Nov 30 '22 at 08:12
  • how did you store that data for you document set? one in the example is for single document only, right? then you can get a sum of all scores and divide it by number of documents – svfat Nov 30 '22 at 08:18
1

To check how many values are shared between extracted_value and real_value. I believe you're looking for the recall of your model, you can use set operations, specifically & (and) divided by your ground truth (real_values):

recall = len(set(real_value) & set(extracted_value))/len(real_values)

or if you want exactly which specific values are shared, which you could always take the len of:

shared_vals = set(real_value) & set(extracted_value)

If you want to then calculate recall with shared_vals:

recall = len(shared_vals)/len(real_value)
Cybergenik
  • 130
  • 1
  • 5
  • 11
0

The metric you're looking for is recall. @sfat's solution works well for a single document, you can then get the average over multiple documents by summing the scores and then dividing by the len of documents.

For more advanced scoring for your retrieval, check the F-Score section of the linked article.

Lukas Schmid
  • 1,895
  • 1
  • 6
  • 18