Extacting data from text files

Question

I have a file with almost 2000 tweets in english. It looks like this:

{"data":[{"no.":"1241583652212862978","created":"2020-03-22T04:33:04.000Z","tweet":"@OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?"},{"no.":"1241583655538941959","created":"2020-03-22T04:33:05.000Z","tweet":" I know it’s from a few days ago, but these books are in good shape}, .......]}

I want to extract only the tweet from the text file. How can I extract only the tweet part from the text file? Any suggestions will be helpful. Thanks in advance.

Does this answer your question? [Reading JSON from a file?](https://stackoverflow.com/questions/20199126/reading-json-from-a-file) — Rakesh, Sep 01 '20 at 04:06
Hi @Rakesh, Thanks for the reply. But that doesn't solve my question. I'm trying to solve this using only 're' package. So that doesn't help me much. — sigma.A, Sep 01 '20 at 04:12
You do not need regex here....its a json file. you can access the required info using key-value. — Rakesh, Sep 01 '20 at 04:15
@Rakesh, the file is a '.txt' file. Not a '.json' file. I have to use regex according to the question i'm solving. — sigma.A, Sep 01 '20 at 04:23

score 0 · Answer 1 · answered Sep 01 '20 at 03:34

0

Your file is in json format. Check Python's json lib so you will be able to extract the tweets. https://docs.python.org/3/library/json.html

answered Sep 01 '20 at 03:34

wildener

1
2

Hi @wildener, is there a possibility of solving this using regular expressions? – sigma.A Sep 01 '20 at 03:44
Well, JSON is by far the best solution, but yes, you can use this pattern: \"tweet\":\"(.*?)\"} Check it here: https://regex101.com/r/qfbjgY/1 – wildener Sep 07 '20 at 11:22

score 0 · Answer 2 · edited Sep 01 '20 at 22:33

Assuming you use d to represent the object it's as simple as:

tweet = d["data"][0]["tweet"]

Also if it helps working example I did in the shell from your examples:

>>> d = {'data': [{'no.': '1241583652212862978', 'created': '2020-03-22T04:33:04.000Z', 'tweet': '@OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?'}, {'no.': '1241583655538941959', 'created': '2020-03-22T04:33:05.000Z', 'tweet': ' I know it’s from a few days ago, but these books are in good shape'}]}
>>> print(d["data"])
[{'no.': '1241583652212862978', 'created': '2020-03-22T04:33:04.000Z', 'tweet': '@OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?'}, {'no.': '1241583655538941959', 'created': '2020-03-22T04:33:05.000Z', 'tweet': ' I know it’s from a few days ago, but these books are in good shape'}]
>>> print(d["data"][0]["tweet"])
@OHAOregon My friend says we should not reuse masks to combat coronavirus, is that correct?
>>>

Extacting data from text files

2 Answers2