I am using tweepy to get tweets pertaining to a certain hashtag(s) and then I send them to a certain black box for some processing. However, tweets containing any URL should not be sent. What would be the most appropriate way of removing any such tweets?
Asked
Active
Viewed 1,484 times
1
-
The most simple solution would probably be to exclude any tweet containing `https://`, `http://` and `www.`. But it's clearly far from perfect. – Antry Apr 26 '18 at 13:50
-
2Remove the URLs with regex. – Colin Ricardo Apr 26 '18 at 13:51
3 Answers
1
To go with @Colin's suggestion, this question covers the issue of finding urls with regex.
An example code snippet would be;
import re
// tweet_list is a list containing string you with to clean of urls
pattern = 'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+'
filtered_tweet_list = [tweet for tweet in tweet_list if not re.findall(pattern, tweet)]

Pax Vobiscum
- 2,551
- 2
- 21
- 32
0
You can also exclude tweets with urls when querying:
if 'https:/' not in tweet.text:
<do something eg. get tweet or in your case: send tweet>

Lyrax
- 331
- 2
- 6
-
This does not answer the question. Hence, it should be removed... came here from review – finnmglas Sep 15 '20 at 18:47
-
I know, hence my use of 'also'. My answer is intended to help other programmers coming here to get ideas to edit their scripts to fit this need. Also, this same method can be used on already scrapped tweets to exclude those with urls! – Lyrax Sep 16 '20 at 20:08