2

I am trying to remove stopwords from strings.

I encountered surprising results with some words combination. Below is the smallest example I could make that exhibits this behavior.

#!/usr/bin/env python
# -*- coding: utf-8 -*- 

import re
import json

en = '''
["different","doesn't","doing","don't","done","down","downwards","during","e","each","edu","eg","eight","either","else","elsewhere","enough","entirely","especially","et","etc","even","ever","every","everybody","everyone","everything","everywhere","ex","exactly","example","except","f","far","few","fifth","first","five","followed","following","follows","for","former","formerly","forth","four","from","further","furthermore","g","get","gets","getting","given","gives","go","goes","going","gone","got","gotten","greetings","h","had","hadn't","known","knows","l","last","lately","later","latter","latterly","least","less","lest","let","let's","like","liked","likely","little","look","looking","looks","ltd","m","mainly","many","may","maybe","me","mean","meanwhile","merely","might","more","moreover","most","mostly","much","must","my","myself","n","name","namely","nd","near"]
'''

fr = '''
["a","abord","absolument","afin","ah","ai","aie","ailleurs","ainsi","ait","allaient","allo","allons","allô","alors","anterieur","anterieure","anterieures","apres","après","as","assez","attendu","au","aucun","aucune","aujourd","aujourd'hui","aupres","auquel","aura","auraient","aurait","auront","aussi","autre","autrefois","autrement","autres","autrui","aux","auxquelles","auxquels","avaient","avais","avait","avant","avec","avoir","avons","ayant","b","bah","bas","basee","bat","beau","beaucoup","bien","bigre","boum","bravo","brrr","c","car","ce","ceci","cela","celle","celle-ci","celle-là","celles","celles-ci","celles-là","celui","celui-ci","celui-là","cent","cependant","certain","certaine","certaines","certains","certes","ces","cet","cette","ceux","ceux-ci","ceux-là","chacun","chacune","chaque","cher","chers","chez","chiche","chut","chère","chères","ci","cinq","cinquantaine","cinquante","cinquantième","cinquième","clac","clic","combien","comme","comment","comparable","comparables","compris","concernant","contre","couic","crac","d","da","dans","de","debout","dedans","dehors","deja","delà","depuis","dernier","derniere","derriere","derrière","des","desormais","desquelles","desquels","dessous","dessus","deux","deuxième","deuxièmement","devant","devers","devra","different","differentes","differents","différent","différente","différentes","différents","dire","directe","directement","dit","dite","dits","divers","diverse","diverses","dix","dix-huit","dix-neuf","dix-sept","dixième","doit","doivent","donc","dont","douze","douzième","dring","du","duquel","durant","dès","désormais","e","effet","egale","egalement","egales","eh","elle","elle-même","elles","elles-mêmes","en","encore","enfin","entre","envers","environ","es","est","et","etant","etc","etre","eu","euh","eux","eux-mêmes","exactement","excepté","extenso","exterieur","f","fais","faisaient","faisant","fait","façon","feront","fi","flac","floc","font","g","gens","h","ha","hein","hem","hep","hi","ho","holà","hop","hormis","hors","hou","houp","hue","hui","huit","huitième","hum","hurrah","hé","hélas","i","il","ils","importe","j","je","jusqu","jusque","juste","k","l","la","laisser","laquelle","le","lequel","les","lesquelles","lesquels","leur","leurs","longtemps","lors","lorsque","lui","lui-meme","lui-même","là","lès","m","ma","maint","maintenant","oust","ouste","outre","ouvert","ouverte","ouverts","o|","où","p","paf","pan","par","parce","parfois","parle","parlent","parler","parmi","parseme","partant","particulier","particulière","probante","procedant","proche","près","psitt","pu","puis","puisque","pur","pure","q","qu","quand","quant","quant-à-soi","quanta","quarante","quatorze","quatre","quatre-vingt","quatrième","quatrièmement","que","quel","quelconque","quelle","quelles","quelqu'un","quelque","quelques","quels","qui","quiconque","quinze","quoi","quoique","r","rare","rarement","rares","relative","relativement","remarquable","rend","rendre","restant","reste","restent","restrictif","retour","revoici","revoilà","rien","sa","sacrebleu","sait","sans","sapristi","sauf","se","sein","seize","selon","semblable","tres","trois","troisième","troisièmement","trop","vrai"]
'''

stopwords = set(json.loads(en) + json.loads(fr))

stopwordsStr = '|'.join(stopwords)
regex = re.compile(r'\b('+stopwordsStr+r')\b')

msg = "le vrai commentaire sur les vous tres time foobar"
print msg
print regex.sub('', msg)

The code above works as expected:

$ python debug.py 
le vrai commentaire sur les vous tres time foobar
  commentaire sur  vous  time foobar

The stopwords are removed correctly.

Now! for the interesting part. If I change the lines defining the English words to this:

en = '''
["time","different","doesn't","doing","don't","done","down","downwards","during","e","each","edu","eg","eight","either","else","elsewhere","enough","entirely","especially","et","etc","even","ever","every","everybody","everyone","everything","everywhere","ex","exactly","example","except","f","far","few","fifth","first","five","followed","following","follows","for","former","formerly","forth","four","from","further","furthermore","g","get","gets","getting","given","gives","go","goes","going","gone","got","gotten","greetings","h","had","hadn't","known","knows","l","last","lately","later","latter","latterly","least","less","lest","let","let's","like","liked","likely","little","look","looking","looks","ltd","m","mainly","many","may","maybe","me","mean","meanwhile","merely","might","more","moreover","most","mostly","much","must","my","myself","n","name","namely","nd","near"]
'''

I just added the keyword "time" at the beginning. I could add it anywhere it would break the same way.

Now I get:

$ python ../converse/debug.py 
le vrai commentaire sur les vous tres time foobar
le vrai commentaire sur  vous tres time foobar

Now some stopwords are not removed anymore. I really don't get what's going on.

If I remove some words from the stopwords list it works correctly again, eg if I remove "doesn't" from the English list.

MasterScrat
  • 7,090
  • 14
  • 48
  • 80
  • Have you tried escaping the words? http://stackoverflow.com/questions/280435/escaping-regex-string-in-python Your problem does sound weird... And for @MosesKoledoye, r specifies `raw string` - http://stackoverflow.com/questions/2081640/what-exactly-do-u-and-r-string-flags-do-in-python-and-what-are-raw-string-l – Yotam Salmon Oct 23 '16 at 23:24
  • @YotamSalmon I already know that, misread that. – Moses Koledoye Oct 23 '16 at 23:30
  • @YotamSalmon adding `stopwords = [ re.escape(stopword) for stopword in stopwords ]` does fix it, thanks... but at this point I'm really more interested in knowing why it behaves like this than about fixing it – MasterScrat Oct 23 '16 at 23:36
  • Your approach is bad. The problem is that you are trying to build a giant alternation with a lot of branches (too many branches). A better approach consists to split your message by non-word characters and to test if each word is in the set (or in a dict). – Casimir et Hippolyte Oct 23 '16 at 23:38
  • @MasterScrat your `fr` list contains `"o|"`. Was it supposed to be just `"o"`? – Mikhail M. Oct 23 '16 at 23:43
  • @CasimiretHippolyte my approach is efficient enough for the problem at hand, that's not the problem... or are you saying the "too many branches" are the cause of this behavior? what makes you think that? – MasterScrat Oct 23 '16 at 23:43
  • @DJV indeed, but removing it doesn't change anything. Edited. – MasterScrat Oct 23 '16 at 23:44
  • @MasterScrat If I'd had to blind-guess the reason, it might be the single quote from the `doesn't`. But I can bet I'm wrong. – Yotam Salmon Oct 23 '16 at 23:46
  • 1
    @MasterScrat, I don't know. Maybe r'||' was a syntax error or smth. In python3 changing that changed the behavior for me. – Mikhail M. Oct 23 '16 at 23:48
  • 1
    @DJV oh yes you are correct I was confused, this does solve the problem. It doesn't explain why this happens with some words and not others but this does mean the resulting regexp was indeed weird, probably incorrect. – MasterScrat Oct 23 '16 at 23:56

1 Answers1

0

fr list has word "o|", which results in '||' in the final regexp. That is not handled well by parser. Changing "o|" to "o" solves the problem.

Or words could be escaped with re.escape. Then error in one word wouldn't ruin whole regex.

Mikhail M.
  • 5,588
  • 3
  • 23
  • 31