How do I compare these two strings in python?

Question

In crawling RSS feed, I do not want duplicate items added to my list. The problem is that some duplicates are not detected by my if title not in mylist line because they are slightly different. Nonetheless, these two news items are basically the same. Take a look at this two.

"Kom igjen, norsk ungdom, de eldre trenger oss!" and
"Kom igjen norsk ungdom, de eldre trenger oss"

As you see, the first one has comma after Kom igjen and the second one doesn't and has an exclamation mark at the end.

Since there is no other unique id that makes individual items unique, I do not know how to detect duplicates like the one above.

Have you tried using a `filter()` to remove the punctuation? — TigerhawkT3, Jun 15 '15 at 21:46
so, where do you draw the line... are you looking to find out how to compare two strings ignoring punctuation (and probably whitespace and case)? — Foon, Jun 15 '15 at 21:46

score 4 · Accepted Answer · answered Jun 15 '15 at 22:42

Python has a SequenceMatcher build-in:

from difflib import SequenceMatcher
SequenceMatcher(None, "Hello you!", "Hello you").ratio()
0.9473684210526315
SequenceMatcher(None, "Apple", "Orange").ratio()
0.18181818181818182

So you can loop over all and compare the ratio against some threshold.

Mazdak · Answer 2 · 2015-06-15T22:14:47.397

1

You can use str.translate method before you add your news to your list to remover punctuations :

>>> s1.translate(None, string.punctuation)
'Kom igjen norsk ungdom de eldre trenger oss'

In that case you'll compare your texts based on theirs alphabets.

In python 3 you can do :

>>> s1.translate(dict.fromkeys(map(ord,string.punctuation),None))
'Kom igjen norsk ungdom de eldre trenger oss'

edited Jun 15 '15 at 22:14

answered Jun 15 '15 at 21:49

Mazdak

105,000
18
159
188

Seems to switch to all lower or upper case and removing any whitespace could also be a good idea? – Jeff B Jun 15 '15 at 21:53
I got error in Python 3. I think ```string.punctuation``` is not valid in Python 3. – Zip Jun 15 '15 at 22:04
@JeffBridgman Yeah, indeed! based on the type of news it could have another options – Mazdak Jun 15 '15 at 22:04

How do I compare these two strings in python?

2 Answers2