I have used a twitter scraper to download some tweets on a sporting event that took place last time. Unfortunately, due to the nature of the research I cannot go back and modify my scraper as the event will not occur again. The tweets are divided up with several categories, such as timestamp, date_created etc.
These tweets are stored in a JSON file and I am currently exporting them to pandas
What I am focusing on is text and extended_tweet categories within the details of each tweet.
Twitter a while back enabled users to now post longer tweets. When it comes to scraping twitter data, if the tweet is under the initial (140? I believe) character limit, then the text of the entire tweet shows up in the text category with no issues, just how I need it for my future research.
However, any tweets above the character limit appear like this in the 'text' category:
@thedamon @getify I worry adding new terms add complexity and may make it harder for people to learn JavaScript. A… <url>
StackOverflow will not allow me to display the short URL which follows, but essentially, as I've just said, its a short twitter URL to the full post
As you can see, the text cuts off with '...' followed by a link. To view to the full text, I need to look at the 'extended_tweet' category, which then places the information as such:
{'full_text': '@thedamon @getify I worry adding new terms add complexity and may make it harder for people to learn JavaScript. A sort function is a function you send to sort. Learning a new acronym to abstract that adds unnecessary complexity.', 'display_text_range': [18, 229], 'entities': {'hashtags': [], 'urls': [], 'user_mentions': [{'screen_name': 'thedamon', 'name': 'Damon Muma', 'id': 29938474, 'id_str': '29938474', 'indices': [0, 9]}, {'screen_name': 'getify', 'name': 'getify', 'id': 16686076, 'id_str': '16686076', 'indices': [10, 17]}], 'symbols': []}}
As you can see, this is a lot more detail than just the text.
I am currently working with Python and attempting to wrap my head around regex. I could easily slice the string from index[i] to index[j] but because all the tweets are different length, I need to ensure I slice the tweet from the point after which it begins, 'full_text': && 'display_text_range'
I'm not asking for someone to do my homework for me, but I have been stuck on this problem for a while and what I initially thought would be easy has turned out to be a lot more difficult than I expected.
Has anybody got any pointers or suggestions I could look into that could help me solve the problem on my own?
Thanks