I'm trying to extract tweets from a huge JSON file and my regex is generating too much data and I cannot for the life of me figure out how to limit it. The regex finds what it's meant to, but it's also tagging too much.
The regex I'm using is as follows (probably more complicated than needed, but that's not what I'm interested in repairing here):
(?:"contributors": .*?, "truncated": .*?, "text": ")([^R][^T].*?)"
Here's a truncated line from the JSON file that generates too much data for example:
{"contributors": null, "truncated": false, "text": "RT @BelloPromotions: Myke Towers Ft. Mariah - Desaparecemos\n@myketowers #myketowers #mariah @mariah #Desaparecemos #music #musica #musicanu\u2026", "is_quote_status": false, "in_reply_to_status_id": null, "id": 1099558111000506369, "favorite_count": 0, "entities": {"symbols": [], "user_mentions": [{"id": 943461023293542400, "indices": [3, 19], "id_str": "943461023293542400", "screen_name": "BelloPromotions", "name": "Bello Promotions \ud83d\udcc8\ud83d\udcb0"}, {"id": 729572008909000704, "indices": [60, 71], "id_str": "729572008909000704", "screen_name": "MykeTowers", "name": "Towers Myke"}, {"id": 775866464, "indices": [92, 99], "id_str": "775866464", "screen_name": "mariah", "name": "Kenzie peretti"}], "hashtags": [{"indices": [72, 83], "text": "myketowers"}, {"indices": [84, 91], "text": "mariah"}, {"indices": [100, 114], "text": "Desaparecemos"}, {"indices": [115, 121], "text": "music"}, {"indices": [122, 129], "text": "musica"}], "urls": []}, "retweeted": false, "coordinates": null, "source": "<a href=\"http://twitter-dummy-auth.herokuapp.com/\" rel=\"nofollow\">Music Twr Suggesting</a>", "in_reply_to_screen_name": null, "in_reply_to_user_id": null, "retweet_count": 18, "id_str": "1099558111000506369", "favorited": false, "retweeted_status": {"contributors": null, "truncated": true, "text": "Myke Towers Ft. Mariah - Desaparecemos\n@myketowers #myketowers #mariah @mariah #Desaparecemos #music #musica\u2026 [link]", .......
From the example above, my regex prints out "myketowers" and then the second instance of the tweet (the original tweet -- after "retweeted_status"). What I want is just the tweet.
Here's the Python code I'm running (it's not throwing any errors and it does exactly what I want it to, just too much):
import re
import codecs
err_occur = []
pattern = re.compile(r'(?:"contributors": .*?, "truncated": .*?, "text": ")([^R][^T].*?)"')
input_filename = 'music_fixed.json'
tweets = open("tweets_380k.txt", "w")
try:
with codecs.open ('music_fixed.json', encoding='utf8') as in_file:
for line in in_file:
matches = pattern.findall(line)
if matches:
for match in matches:
err_occur.append(match)
except FileNotFoundError:
print("Input file %r not found." % input_filename)
for tagged in err_occur:
tweets.write(str(tagged)+"\n")
As explained above, the expected output of the regex for the line of the JSON posted is:
Myke Towers Ft. Mariah - Desaparecemos\n@myketowers #myketowers #mariah @mariah #Desaparecemos #music #musica\u2026 [link]
What ends up getting written to my text file is:
myketowers
Myke Towers Ft. Mariah - Desaparecemos\n@myketowers #myketowers #mariah @mariah #Desaparecemos #music #musica\u2026 [link]