1

I'm trying to extract tweets from a huge JSON file and my regex is generating too much data and I cannot for the life of me figure out how to limit it. The regex finds what it's meant to, but it's also tagging too much.

The regex I'm using is as follows (probably more complicated than needed, but that's not what I'm interested in repairing here):

(?:"contributors": .*?, "truncated": .*?, "text": ")([^R][^T].*?)"

Here's a truncated line from the JSON file that generates too much data for example:

{"contributors": null, "truncated": false, "text": "RT @BelloPromotions: Myke Towers Ft. Mariah - Desaparecemos\n@myketowers #myketowers #mariah @mariah #Desaparecemos #music #musica #musicanu\u2026", "is_quote_status": false, "in_reply_to_status_id": null, "id": 1099558111000506369, "favorite_count": 0, "entities": {"symbols": [], "user_mentions": [{"id": 943461023293542400, "indices": [3, 19], "id_str": "943461023293542400", "screen_name": "BelloPromotions", "name": "Bello Promotions \ud83d\udcc8\ud83d\udcb0"}, {"id": 729572008909000704, "indices": [60, 71], "id_str": "729572008909000704", "screen_name": "MykeTowers", "name": "Towers Myke"}, {"id": 775866464, "indices": [92, 99], "id_str": "775866464", "screen_name": "mariah", "name": "Kenzie peretti"}], "hashtags": [{"indices": [72, 83], "text": "myketowers"}, {"indices": [84, 91], "text": "mariah"}, {"indices": [100, 114], "text": "Desaparecemos"}, {"indices": [115, 121], "text": "music"}, {"indices": [122, 129], "text": "musica"}], "urls": []}, "retweeted": false, "coordinates": null, "source": "<a href=\"http://twitter-dummy-auth.herokuapp.com/\" rel=\"nofollow\">Music Twr Suggesting</a>", "in_reply_to_screen_name": null, "in_reply_to_user_id": null, "retweet_count": 18, "id_str": "1099558111000506369", "favorited": false, "retweeted_status": {"contributors": null, "truncated": true, "text": "Myke Towers Ft. Mariah - Desaparecemos\n@myketowers #myketowers #mariah @mariah #Desaparecemos #music #musica\u2026 [link]", .......

From the example above, my regex prints out "myketowers" and then the second instance of the tweet (the original tweet -- after "retweeted_status"). What I want is just the tweet.

Here's the Python code I'm running (it's not throwing any errors and it does exactly what I want it to, just too much):

import re
import codecs

err_occur = []
pattern = re.compile(r'(?:"contributors": .*?, "truncated": .*?, "text": ")([^R][^T].*?)"') 
input_filename = 'music_fixed.json'
tweets = open("tweets_380k.txt", "w")

try:
    with codecs.open ('music_fixed.json', encoding='utf8') as in_file:
        for line in in_file:
            matches = pattern.findall(line)
            if matches:
                for match in matches:
                    err_occur.append(match)
except FileNotFoundError:
    print("Input file %r not found." % input_filename)

for tagged in err_occur:
    tweets.write(str(tagged)+"\n")

As explained above, the expected output of the regex for the line of the JSON posted is:

Myke Towers Ft. Mariah - Desaparecemos\n@myketowers #myketowers #mariah @mariah #Desaparecemos #music #musica\u2026 [link]

What ends up getting written to my text file is:

myketowers
Myke Towers Ft. Mariah - Desaparecemos\n@myketowers #myketowers #mariah @mariah #Desaparecemos #music #musica\u2026 [link]
Arie G.
  • 87
  • 6
  • When I try it, the regex does not seem to find any matches in your truncated line... your output looks like you might have nested groups, but I can not reproduce that with your regex. – tobias_k Aug 29 '19 at 12:10
  • 3
    I would not use regex for that. Consider your huge file infinite: a stream. Now you could parse your file as a JSON stream. I have done this in c# but there should be usable answers for python also. Start here: https://stackoverflow.com/questions/6886283/how-i-can-i-lazily-read-multiple-json-values-from-a-file-stream-in-python – ZorgoZ Aug 29 '19 at 12:10
  • 1
    Good point: Why use regex if you actually have well-structured JSON data? Just use `json` to load the file and get the contents of the `"text"` fields. – tobias_k Aug 29 '19 at 12:12
  • I'm just trying to extract the tweets to a text file in order to generate some basic statistics such as average word count, etc. – Arie G. Aug 29 '19 at 12:13
  • @ArieG. Regex is expensive compared to parsing JSON even if you are loading and the whole file. – ZorgoZ Aug 29 '19 at 12:29
  • @tobias_k - When I try it, I get exactly the shown results in the output file. – Armali Aug 30 '19 at 11:00

2 Answers2

0

How to limit regex results?

Before I simply answer the question, I should clarify why the present expression yields an unwanted result: In the sub-expression (?:"contributors": .*?, "truncated": .*?, "text": "), the last .*?, despite its non-greediness, matches all the input

false, "text": "RT @BelloPromotions: Myke Towers Ft. Mariah - Desaparecemos\n@myketowers #myketowers #mariah @mariah #Desaparecemos #music #musica #musicanu\u2026", "is_quote_status": false, "in_reply_to_status_id": null, "id": 1099558111000506369, "favorite_count": 0, "entities": {"symbols": [], "user_mentions": [{"id": 943461023293542400, "indices": [3, 19], "id_str": "943461023293542400", "screen_name": "BelloPromotions", "name": "Bello Promotions \ud83d\udcc8\ud83d\udcb0"}, {"id": 729572008909000704, "indices": [60, 71], "id_str": "729572008909000704", "screen_name": "MykeTowers", "name": "Towers Myke"}, {"id": 775866464, "indices": [92, 99], "id_str": "775866464", "screen_name": "mariah", "name": "Kenzie peretti"}], "hashtags": [{"indices": [72, 83]

i. e. everything after the first "truncated": until the next , "text": which isn't ruled out by a following "RT…", that is the one just before the unwanted "myketowers".

So, to bar the expression from matching all that input, we could simply not allow every character (.) to go between "truncated": and , "text":, but rather only those characters which form the possible values false and true there, or for the sake of simplicity only word characters (\w); hence it suffices to change the above sub-expression to (?:"contributors": .*?, "truncated": \w*, "text": ").

Armali
  • 18,255
  • 14
  • 57
  • 171
0

As others have remarked in comments, you should probably be using a JSON parser and taking it from there.

However, if your input is not JSON (or pulling it all into memory at once is not feasible), there are a couple of tweaks you should do to your regex.

Firstly (and again, as others have already remarked), .*? is only "non-greedy" in the sense that it will find the shortest possible match; it will still find a match if there is one. I'm guessing you could trim this to

(?:[^"\\]+\\.)*)[^"\\]*

to only grab strings which do not contain unescaped double quotes.

Secondly, I'm guessing you were hoping [^R][^T] would skip matches which contain RT at the start; but that's not what it means. It will skip matches which have a character which is not R followed by a character which is not T. So it will not match AT or Re either!

In Python (and generally PCRE-compatible) regex the way to say "must not match" is a negative lookahead (?!RT).

Pulling this all together, try

pattern = re.compile(r'(?:"contributors": "(?:[^"\\]+\\.)*)[^"\\]*",'
    r' "truncated": "(?:[^"\\]+\\.)*)[^"\\]*",'
    r' "text": ")((?!RT)(?:[^"\\]+\\.)*)[^"\\]*)"')

Please understand that I had to guess or read between the lines in a couple of places here. If you can update your question to explain exactly what your data looks like and how you hope the logic should work, this could probably be improved or at least tweaked to do what you really want.

tripleee
  • 175,061
  • 34
  • 275
  • 318