0

I am trying to order a number of short paragraphs by their agreement with a list of keywords. This is used to provide a user with the text ordered by interest.

Let's assume I already have the list of keywords, hopefully reflecting the users interest. I thought this is a fairly standard procedure and expected some python package for that. But so far my Google search was not very successful.

I can easily come up with a brute force solution myself, but I was wondering whether somebody knows an efficient way to do this?

EDIT: Ok here is an example: keywords = ['cats', 'food', 'Miau']

text1 = 'This is text about dogs'
text2 = 'This is text about food'
text3 = 'This is text about cat food'

I need a procedure which leads to the order text3, text2, text1 thanks

carl
  • 4,216
  • 9
  • 55
  • 103
  • What does "agreement" mean here? Can you post an example? – Daniel Roseman Oct 22 '15 at 20:25
  • ordering by how many of the (lets say) 100 keywords are found in the text – carl Oct 22 '15 at 20:26
  • 1
    Could you provide example inputs and expected output? I'm afraid your question isn't very clear to me. Also, "please point me to a library" and "please write code for me" type questions are both disallowed by site policy, so you want to avoid sounding like either. Please [edit] the question to update it. – tripleee Oct 22 '15 at 20:27

2 Answers2

2

This is the simplest thing I can think of:

import string

input = open('document.txt', 'r')
text = input.read()

table = string.maketrans("","")
text = text.translate(table, string.punctuation)

wordlist = text.split()
agreement_cnt = 0

for word in list_of_keywords:
    agreement_cnt += wordlist.count(word)

got the removing punctuation bit from here: Best way to strip punctuation from a string in Python.

Community
  • 1
  • 1
Casey P
  • 140
  • 1
  • 9
  • While we await OP's edit: isn't this the brute force approach? – Jongware Oct 22 '15 at 20:36
  • 1
    oops I didn't read the original question that closely apparently haha. I wonder what wouldn't be considered "the brute force approach" then. I don't think there is a way around checking each word versus each keyword. I think I might be missing the point of his request. – Casey P Oct 22 '15 at 20:41
  • yep that solution will certainly work... do we all agree that this is also optimal? – carl Oct 22 '15 at 20:41
  • just realized punctuation would break my first crack at it. See the edit above. – Casey P Oct 22 '15 at 20:49
0

Something like this might be a good starting point:

>>> keywords = ['cats', 'food', 'Miau']
>>> text1 = 'This is a text about food fed to cats'
>>> matched_word_count = len(set(text1.split()).intersection(set(keywords)))
>>> print matched_word_count
2

If you want to correct for capitalization or capture word forms (i.e. 'cat' instead of 'cats'), there's obviously more to consider, though.

Taking the above and capturing match counts for a list of different strings, and then sorting the results to find the "best" match, should be relatively simple.

twalberg
  • 59,951
  • 11
  • 89
  • 84