Parsing tweets in json format to find tweeter users

Question

I am reading a tweeter feed in json format to read the number of users. Some lines in the input file might not be tweets, but messages that the Twitter server sent to the developer (such as limit notices). I need to ignore these messages.

These messages would not contain the created_at field and can be filtered out accordingly.

I have written the following piece of code, to extract the valid tweets, and then extract the user.id and the text.

def safe_parse(raw_json):
    try:
        json_object = json.loads(raw_json)
        if 'created_at' in json_object:
            return json_object
        else:
            return
    except ValueError as error:
        return

def get_usr_txt (line):
    tmp = safe_parse(line)
    if(tmp != None):
        return ((tmp.get('user').get('id_str'),tmp.get('text')))
    else:
        return

My challenge is that I get one extra user called "None"

Here is a sample output (it is a large file)

('49838600', 'This is the temperament you look for in a guy who would have access to our nuclear arsenal. ), None, ('2678507624', 'RT @GrlSmile: @Ricky_Vaughn99 Yep, which is why in 1992 I switched from Democrat to Republican to vote Pat Buchanan, who warned of all of t…'),

I am struggling to find out, what I am doing wrong. There is no None in the tweeter file, hence I am assuming that I am reading the {"limit":{"track":1,"timestamp_ms":"1456249416070"}} but the code above should not include it, unless I am missing something.

Any pointers? and thanks for the your help and your time.

Alper t. Turker · Accepted Answer · 2018-05-19T10:53:18.160

0

Some lines in the input file might not be tweets, but messages that the Twitter server sent to the developer (such as limit notices). I need to ignore these messages.

That's not exactly what happens. If one of the following happens:

raw_json is not a valid JSON document
created_at is not in the parsed object.

you return with default value, which is None. If you want to ignore these, you can add filter step between two operations:

rdd.map(safe_parse).filter(lambda x: x).map(get_usr_txt)

You can also use flatMap trick to avoid filter and simplify your code (borrowed from this answer by zero323):

def safe_parse(raw_json):
    try:
        json_object = json.loads(raw_json)
    except ValueError as error:
        return []
    else:
        if 'created_at' in json_object:
            yield json_object

rdd.flatMap(safe_parse).map(get_usr_txt)

edited May 19 '18 at 10:53

answered May 19 '18 at 10:47

Alper t. Turker

34,230
9
83
115

Thanks user9613318, the default None return made sense. I was not aware of it. Now I can avoid it, but just to get my understanding right, how does filter and lambda avoid the None value? Can you describe a bit, if possible. – Rvsvgs May 19 '18 at 17:57
1

The trick is that ` bool(None)` is `False`, so `None` will be ignored. You can use more verbose `filter(lambda x: x is not None)` if you prefer. – Alper t. Turker May 19 '18 at 20:04

Parsing tweets in json format to find tweeter users

1 Answers1