Removing unncessary details for twitters extended_tweet column in JSON/Python

Question

I have used a twitter scraper to download some tweets on a sporting event that took place last time. Unfortunately, due to the nature of the research I cannot go back and modify my scraper as the event will not occur again. The tweets are divided up with several categories, such as timestamp, date_created etc.

These tweets are stored in a JSON file and I am currently exporting them to pandas

What I am focusing on is text and extended_tweet categories within the details of each tweet.

Twitter a while back enabled users to now post longer tweets. When it comes to scraping twitter data, if the tweet is under the initial (140? I believe) character limit, then the text of the entire tweet shows up in the text category with no issues, just how I need it for my future research.

However, any tweets above the character limit appear like this in the 'text' category:

@thedamon @getify I worry adding new terms add complexity and may make it harder for people to learn JavaScript. A… <url> StackOverflow will not allow me to display the short URL which follows, but essentially, as I've just said, its a short twitter URL to the full post

As you can see, the text cuts off with '...' followed by a link. To view to the full text, I need to look at the 'extended_tweet' category, which then places the information as such:

{'full_text': '@thedamon @getify I worry adding new terms add complexity and may make it harder for people to learn JavaScript. A sort function is a function you send to sort. Learning a new acronym to abstract that adds unnecessary complexity.', 'display_text_range': [18, 229], 'entities': {'hashtags': [], 'urls': [], 'user_mentions': [{'screen_name': 'thedamon', 'name': 'Damon Muma', 'id': 29938474, 'id_str': '29938474', 'indices': [0, 9]}, {'screen_name': 'getify', 'name': 'getify', 'id': 16686076, 'id_str': '16686076', 'indices': [10, 17]}], 'symbols': []}}

As you can see, this is a lot more detail than just the text.

I am currently working with Python and attempting to wrap my head around regex. I could easily slice the string from index[i] to index[j] but because all the tweets are different length, I need to ensure I slice the tweet from the point after which it begins, 'full_text': && 'display_text_range'

I'm not asking for someone to do my homework for me, but I have been stuck on this problem for a while and what I initially thought would be easy has turned out to be a lot more difficult than I expected.

Has anybody got any pointers or suggestions I could look into that could help me solve the problem on my own?

Thanks

I dno but try not to add any new acronyms! – Damon Dec 02 '20 at 04:03 — Damon, Dec 02 '20 at 04:03

score 0 · Answer 1 · answered Feb 25 '20 at 18:49

0

Why not parse the JSON to get full_text property?

import json

data = '''
{"full_text": "@thedamon @getify I worry adding new terms add complexity and may make it harder for people to learn JavaScript. A sort function is a function you send to sort. Learning a new acronym to abstract that adds unnecessary complexity.", "display_text_range": [18, 229], "entities": {"hashtags": [], "urls": [], "user_mentions": [{"screen_name": "thedamon", "name": "Damon Muma", "id": 29938474, "id_str": "29938474", "indices": [0, 9]}, {"screen_name": "getify", "name": "getify", "id": 16686076, "id_str": "16686076", "indices": [10, 17]}], "symbols": []}}'''

parsed_data = json.loads(data)
print(parsed_data['full_text']) # prints full tweet '@thedamon @getify I worry .... unnecessary complexity.'

answered Feb 25 '20 at 18:49

stackoverflowusrone

536
3
10

Is there a way I can implement this for every line in the JSON file? Some of the tweets contain 'NaN' for the extended tweet aspect as they are short enough to fit in text, and I need my code to extract it for every tweet in a large file (40k+ tweets) that contains entries in the full_text field – Conor McNally Feb 25 '20 at 19:14
One final comment. Parsing has been exactly what I needed in order to do what I am aiming to achieve. Thank you so much for this, some minor issues with the code right now but nothing I cannot resolve on my own, thanks for pointing me in the right direction! – Conor McNally Feb 25 '20 at 19:21
You can read the file using `open` function and then parse it accordingly. if JSON file contains array of tweets, then just parse and loop over and access like python dicts. – stackoverflowusrone Feb 25 '20 at 20:35
Use `json.load` for reading from file https://stackoverflow.com/questions/39719689/what-is-the-difference-between-json-load-and-json-loads-functions – stackoverflowusrone Feb 25 '20 at 20:47

Removing unncessary details for twitters extended_tweet column in JSON/Python

1 Answers1