I have a list of strings associated with twitter hashtags. I want to remove entire strings that begin with certain prefixes.
For example:
testlist = ['Just caught up with #FlirtyDancing. Just so cute! Loved it. ', 'After work drinks with this one @MrLukeBenjamin no dancing tonight though @flirtydancing @AshleyBanjo #FlirtyDancing pic.twitter.com/GJpRUZxUe8', 'Only just catching up and @AshleyBanjo you are gorgeous #FlirtyDancing', 'Loved working on this. Always a pleasure getting to assist the wonderful @kendrahorsburgh on @ashleybanjogram wonderful new show !! #flirtydancing pic.twitter.com/URMjUcgmyi', 'Just watching #FlirtyDancing & \n@AshleyBanjo what an amazing way to meet someone.. It made my heart all warm & fuzzy for these people! both couples meet back up.. pic.twitter.com/iwCLRmAi5n',]
I would like to remove the picture URL's, the hashtags, and the @'s
I have tried a few things so far, namely using the startswith()
method and the replace()
method.
For example:
prefixes = ['pic.twitter.com', '#', '@']
bestlist = []
for line in testlist:
for word in prefixes:
line = line.replace(word,"")
bestlist.append(line)
This seems to get rid of the 'pic.twitter.com', but not the series of letters and numbers at the end of the URL. These strings are dynamic and will have a different end URL each time...which is why I want to get rid of the entire string if they begin with that prefix.
I also tried tokenizing everything, but replace()
still won't get rid of the entire word:
import nltk
for line in testlist:
tokens = nltk.tokenize.word_tokenize(line)
for token in tokens:
for word in prefixes:
if token.startswith(word):
token = token.replace(word,"")
print(token)
I am starting to lose hope in the startswith()
method and the replace()
method, and feel I might be barking up the wrong tree with these two.
Is there a better way to go about this? How can I achieve the desired result of removing all strings beginning with #, @, and pic.twitter?