0

I am trying to scrape tweets with a hashtag and I want tweets only in the Arabic Language. But I still get tweets in all languages. Can anyone help please

import snscrape.modules.twitter as sntwitter

query = ["#Covid19", 'lang: ar']
tweets = []
limit = 5000

for tweet in sntwitter.TwitterSearchScraper(query).get_items():
  if len(tweets) == limit:
    break
  else:
    tweets.append([tweet.date, tweet.username, tweet.content])
Huda Mg
  • 11
  • 2

2 Answers2

0

Well one of the ways to think about this problem is to check if the tweet.content actually contains something else from another language or check if the text is arabic(better solution.) To check if the text is arabic you can use the langdetect module and here is a simple implementation of how you can actually use it.

from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
detect('今一はお前さん')

For more reference click the link: Language detection with python

0

Twitter offers a myriad of advanced searches.

The one you're looking for is likely lang:ar, for Arabic.

I see that you're already using that! But you cannot include a space between the colon and the language code. lang: ar will not work.

Twitter's search is weird, so you might get incomplete results or too many results. If you go this route, you may want to use langdetect or similar (as mentioned in another answer) to complement this solution.

TheTechRobo the Nerd
  • 1,249
  • 15
  • 28