0

I have a file with almost 2000 tweets in english. It looks like this:

{"data":[{"no.":"1241583652212862978","created":"2020-03-22T04:33:04.000Z","tweet":"@OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?"},{"no.":"1241583655538941959","created":"2020-03-22T04:33:05.000Z","tweet":" I know it’s from a few days ago, but these books are in good shape}, .......]}

I want to extract only the tweet from the text file. How can I extract only the tweet part from the text file? Any suggestions will be helpful. Thanks in advance.

  • Does this answer your question? [Reading JSON from a file?](https://stackoverflow.com/questions/20199126/reading-json-from-a-file) – Rakesh Sep 01 '20 at 04:06
  • Hi @Rakesh, Thanks for the reply. But that doesn't solve my question. I'm trying to solve this using only 're' package. So that doesn't help me much. – sigma.A Sep 01 '20 at 04:12
  • You do not need regex here....its a json file. you can access the required info using key-value. – Rakesh Sep 01 '20 at 04:15
  • @Rakesh, the file is a '.txt' file. Not a '.json' file. I have to use regex according to the question i'm solving. – sigma.A Sep 01 '20 at 04:23

2 Answers2

0

Your file is in json format. Check Python's json lib so you will be able to extract the tweets. https://docs.python.org/3/library/json.html

wildener
  • 1
  • 2
  • Hi @wildener, is there a possibility of solving this using regular expressions? – sigma.A Sep 01 '20 at 03:44
  • Well, JSON is by far the best solution, but yes, you can use this pattern: \"tweet\":\"(.*?)\"} Check it here: https://regex101.com/r/qfbjgY/1 – wildener Sep 07 '20 at 11:22
0

Assuming you use d to represent the object it's as simple as:

tweet = d["data"][0]["tweet"]

Also if it helps working example I did in the shell from your examples:

>>> d = {'data': [{'no.': '1241583652212862978', 'created': '2020-03-22T04:33:04.000Z', 'tweet': '@OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?'}, {'no.': '1241583655538941959', 'created': '2020-03-22T04:33:05.000Z', 'tweet': ' I know it’s from a few days ago, but these books are in good shape'}]}
>>> print(d["data"])
[{'no.': '1241583652212862978', 'created': '2020-03-22T04:33:04.000Z', 'tweet': '@OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?'}, {'no.': '1241583655538941959', 'created': '2020-03-22T04:33:05.000Z', 'tweet': ' I know it’s from a few days ago, but these books are in good shape'}]
>>> print(d["data"][0]["tweet"])
@OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?
>>> 
Dharman
  • 30,962
  • 25
  • 85
  • 135
Dan Alexander
  • 2,004
  • 6
  • 24
  • 34