1

I'm building a little 'trending' algorithm. The tokeniser works as originally intended, bar a couple of hiccups around URLs, which are causing some problems.

Obviously, as I'm pulling info from twitter, there are a lot of t.co URL shortner type links. I'd like to remove these as not 'words', preferably at the tokeniser stage, but am currently filtering them out post-fact. I can't (I don't think) run the tokens against a recognisable English whitelist, as again, Twitter, and contractions, etc.

My code that wraps around the function that pulls the top 10 most common words in any given period is:

tweets = Tweet.objects.filter(lang='en', created_at__gte=start, created_at__lte=end)
number_of_tweets = tweets.count()
most_popular = trending.run_all(start, end, "word").keys()[:10]
print "BEFORE", most_popular
for i, thing in enumerate(most_popular):
    try:
        if "/" in thing:
            most_popular.remove(thing)
            print i, thing, "Removed it."
    except UnicodeEncodeError, e:
        print "Unicode error", e
        most_popular.remove(thing)
print "NOW", most_popular`

That try/except block should, in theory, remove any of the URL featured words from the token list - except it doesn't, I'm always left with a couple.

Running trending.run_all on a time period gives, for example:

[u'//t.co/r6gkL104ai/nKate', u'EXPLAIN', u'\U0001f62b\U0001f62d/nRT', u'woods', u'hanging', u'ndtv/nRT', u'BenDohertyCorro', u'\u0928\u093f\u0930\u094d\u0926\u094b\u0937_\u092c\u093e\u092a\u0942\u2026/nPolice', u'LAST', u'health/nTime']

Running the rest of the code imported into python commandline gives:

0 //t.co/r6gkL104ai/nKate Removed it
1 /nRT Removed it
2 hanging 
3 ndtv/nRT Removed it
4 निर्दोष_बापू…/nPolice Removed it
5 health/nTime Removed it
6 Western 7 //t.co/4dhGoBpzR0 Removed it
8 //t.co/TkHhI7n…/nRT Removed it
9 //t.co/WmWkcG1dOz/nRT Removed it
10 bringing 
 ...
32 kids

NOW [u'EXPLAIN', u'woods', u'hanging', u'BenDohertyCorro', u'LAST', u'scolo', u'Western', u'//t.co/jB0TWYAJSI/nMe', u'BREAKINGNEWS', u'//t.co/9gYG8y5OKK', u'bringing', u'Valls', u'advices', u'Signatures', u'//t.co/vmQfyenXp4/nJury', u'strengthandcondition\u2026', u'HAPPENED', u'\u2705', u'\U0001f60f', u'//t.co/5JR8RXsJ87/nIs', u'Hamilton', u'Logging', u'Happening', u'Foundation', u'//t.co/gC959Q43QD/nRT', u'ISIS=CIA', u'Footnotes', u'ARYNEWSOFFICIAL', u'LoveMyLife', u'-they', u'B\xf6rse', u'InfoTerrorism', u'kids']

So for some reason, that little hunk isn't (consistently) cutting them out, or isn't acting as expected. This causes a particular problem with reverse lookups in Django, as I intend to use the top X phrases in a period as clickable links - obviously that breaks the lookup completely, and there (rightly) doesn't seem to be a way to Except out of that in the template, so I'd rather take care of this in the views.

Withnail
  • 3,128
  • 2
  • 30
  • 47

1 Answers1

1

It seems to me that the issue you are having is that you are deleting a list while iterating over it. The solution is simple: You should iterate on a copy of your list:

for i, thing in enumerate(most_popular[:]):

notice the '[:]' which will create a copy of your list.

The reason for this behavior can be found in this post.

Community
  • 1
  • 1
Tom
  • 1,105
  • 8
  • 17
  • The first line there should be: `for i, thing in enumerate(most_popular[:]):` - this does now throw a different `error list.remove(x): x not in list`, which suggests this isn't quite the right answer yet. – Withnail Jun 07 '15 at 20:43
  • Do you need the "enumerate" object? You can iterate on the list without it. – Tom Jun 08 '15 at 00:13
  • You're right, I just had that for debugging purposes, it's still throwing the x not in list error, though. – Withnail Jun 08 '15 at 07:12
  • Can you check/print what element in the list creates this error? – Tom Jun 08 '15 at 07:32
  • What I eventually got to work was: `for thing in most_popular[:]: try: if "/" in thing: most_popular.remove(thing) print thing, "Removed it." elif r'\'' in thing: most_popular.remove(thing) print thing, "Removed it." except UnicodeEncodeError, e: try: most_popular.remove(thing) except ValueError: pass except ValueError: pass` (ugh, formatting in comments) – Withnail Jun 08 '15 at 08:57