Removing strings from list that start with certain expressions

Question

I have a list of strings associated with twitter hashtags. I want to remove entire strings that begin with certain prefixes.

For example:

testlist = ['Just caught up with #FlirtyDancing. Just so cute! Loved it. ', 'After work drinks with this one @MrLukeBenjamin no dancing tonight though @flirtydancing @AshleyBanjo #FlirtyDancing pic.twitter.com/GJpRUZxUe8', 'Only just catching up and @AshleyBanjo you are gorgeous #FlirtyDancing', 'Loved working on this. Always a pleasure getting to assist the wonderful @kendrahorsburgh on @ashleybanjogram wonderful new show !! #flirtydancing pic.twitter.com/URMjUcgmyi', 'Just watching #FlirtyDancing & \n@AshleyBanjo what an amazing way to meet someone.. It made my heart all warm & fuzzy for these people! both couples meet back up.. pic.twitter.com/iwCLRmAi5n',]

I would like to remove the picture URL's, the hashtags, and the @'s

I have tried a few things so far, namely using the startswith() method and the replace() method.

For example:

prefixes = ['pic.twitter.com', '#', '@']
bestlist = []

for line in testlist:
    for word in prefixes:
        line = line.replace(word,"")
        bestlist.append(line)

This seems to get rid of the 'pic.twitter.com', but not the series of letters and numbers at the end of the URL. These strings are dynamic and will have a different end URL each time...which is why I want to get rid of the entire string if they begin with that prefix.

I also tried tokenizing everything, but replace() still won't get rid of the entire word:

import nltk 

for line in testlist:
tokens = nltk.tokenize.word_tokenize(line)
for token in tokens:
    for word in prefixes:
        if token.startswith(word):
            token = token.replace(word,"")
            print(token)

I am starting to lose hope in the startswith() method and the replace() method, and feel I might be barking up the wrong tree with these two.

Is there a better way to go about this? How can I achieve the desired result of removing all strings beginning with #, @, and pic.twitter?

score 4 · Accepted Answer · answered Mar 21 '19 at 06:29

You can use a regular expression to specify the types of words you want to replace and use re.sub

import re

testlist = ['Just caught up with #FlirtyDancing. Just so cute! Loved it. ', 'After work drinks with this one @MrLukeBenjamin no dancing tonight though @flirtydancing @AshleyBanjo #FlirtyDancing pic.twitter.com/GJpRUZxUe8', 'Only just catching up and @AshleyBanjo you are gorgeous #FlirtyDancing', 'Loved working on this. Always a pleasure getting to assist the wonderful @kendrahorsburgh on @ashleybanjogram wonderful new show !! #flirtydancing pic.twitter.com/URMjUcgmyi', 'Just watching #FlirtyDancing & \n@AshleyBanjo what an amazing way to meet someone.. It made my heart all warm & fuzzy for these people! both couples meet back up.. pic.twitter.com/iwCLRmAi5n',]
regexp = r'pic\.twitter\.com\S+|@\S+|#\S+'

res = [re.sub(regexp, '', sent) for sent in testlist]
print(res)

Output

Just caught up with  Just so cute! Loved it. 
After work drinks with this one  no dancing tonight though    
Only just catching up and  you are gorgeous 
Loved working on this. Always a pleasure getting to assist the wonderful  on  wonderful new show !!  
Just watching  & 
 what an amazing way to meet someone.. It made my heart all warm & fuzzy for these people! both couples meet back up..

Leonid · Answer 2 · 2019-03-21T06:38:35.897

This solution does not use regex or any additional imports.

prefixes = ['pic.twitter.com', '#', '@']
testlist = ['Just caught up with #FlirtyDancing. Just so cute! Loved it. ', 'After work drinks with this one @MrLukeBenjamin no dancing tonight though @flirtydancing @AshleyBanjo #FlirtyDancing pic.twitter.com/GJpRUZxUe8', 'Only just catching up and @AshleyBanjo you are gorgeous #FlirtyDancing', 'Loved working on this. Always a pleasure getting to assist the wonderful @kendrahorsburgh on @ashleybanjogram wonderful new show !! #flirtydancing pic.twitter.com/URMjUcgmyi', 'Just watching #FlirtyDancing & \n@AshleyBanjo what an amazing way to meet someone.. It made my heart all warm & fuzzy for these people! both couples meet back up.. pic.twitter.com/iwCLRmAi5n',]


def iter_tokens(line):
    for word in line.split():
        if not any(word.startswith(prefix) for prefix in prefixes):
            yield word

for line in testlist:
    row = list(iter_tokens(line))
    print(' '.join(row))

This yields the following result:

python test.py 
Just caught up with Just so cute! Loved it.
After work drinks with this one no dancing tonight though
Only just catching up and you are gorgeous
Loved working on this. Always a pleasure getting to assist the wonderful on wonderful new show !!
Just watching & what an amazing way to meet someone.. It made my heart all warm & fuzzy for these people! both couples meet back up..

score 1 · Answer 3 · edited Oct 07 '21 at 11:07

You need to match using regular expressions rather than static strings. replace does not recognize regular expressions. You need to use re.sub instead. To remove urls as you've described from a single string s you would need something like the following:

import re
re.sub('pic\.twitter\.com[^a-zA-Z0-9,.\-!/()=?`*;:_{}\[\]\|~%-]*', '', s)

To match tags, replies, and urls you can perform successive sub operations, or combine all regular expressions into a single expression. The former is better if you have many patterns, and should be combined with re.compile.

Note this will only match urls with domain twitter.com and sub-domain pic. To match any url, you'll have to augment the regex with the appropriate match pattern. Possibly see this post.

edit: generalized the regular expression according to RFC 3986 as per I.Am.A.Guy's comment.

Nice catch. Updated with a more robust regex. – pkfm Mar 21 '19 at 08:14 — pkfm, Mar 21 '19 at 08:14

score 1 · Answer 4 · answered Mar 21 '19 at 07:02

prefixes = {'pic.twitter.com', '#', '@'} # use sets for faster lookups

def clean_tweet(tweet):
    return " ".join(for word in line.split() if (word[:15] not in prefixes) or (word[0] not in prefixes))

Or look at:

https://www.nltk.org/api/nltk.tokenize.html

TweetTokenizer can solve much of your problems.

Removing strings from list that start with certain expressions

4 Answers4