0

I've created a very simple piece of code to read in tweets in JSON format in text files, determine if they contain an id and coordinates and if so, write these attributes to a csv file. This is the code:

f = csv.writer(open('GeotaggedTweets/ListOfTweets.csv', 'wb+'))
all_files = glob.glob('SampleTweets/*.txt')
for filename in all_files:
    with open(filename, 'r') as file:
        data = simplejson.load(file)
        if 'text' and 'coordinates' in data:
            f.writerow([data['id'], data['geo']['coordinates']])

I've been having some difficulties but with the help of the excellent JSON Lint website have realised my mistake. I have multiple JSON objects and from what I read these need to be separated by commas and have square brackets added to the start and end of the file.

How can I achieve this? I've seen some examples online where each individual line is read and it's added to the first and last line, but as I load the whole file I'm not entirely sure how to do this.

Andrew Martin
  • 5,619
  • 10
  • 54
  • 92
  • 1
    If your JSON strings are saved to the file as one entry *per line*, you don't need to do that. See [Loading & Parsing JSON file in python](http://stackoverflow.com/a/12451465) – Martijn Pieters Aug 13 '13 at 13:46
  • Following that link gives me a "No JSON object could be decoded" error – Andrew Martin Aug 13 '13 at 13:49
  • I said 'if'. How did you *create* the file in the first place? – Martijn Pieters Aug 13 '13 at 13:50
  • Sorry! I downloaded the tweets using Flume and piped them into Hadoop. I then used Hadoop's "copyToLocal" method to output them into a text file. If I take each individual tweet and put it into JSON Lint, they're all perfect. It's when they're together they're problematic. – Andrew Martin Aug 13 '13 at 13:52

1 Answers1

1

You have a file that either contains too many newlines (in the JSON values themselves) or too few (no newlines between the tweets at all).

You can still repair this by using some creative re-stitching. The following generator function should do it:

import json

def read_objects(filename):
    decoder = json.JSONDecoder()

    with open(filename, 'r') as inputfile:
        line = next(inputfile).strip()
        while line:
            try:
                obj, index = decoder.raw_decode(line)
                yield obj
                line = line[index:]
            except ValueError:
                # Assume we didn't have a complete object yet
                line += next(inputfile).strip()
            if not line:
                line += next(inputfile).strip()

This should be able to read all your JSON objects in sequence:

for filename in all_files:
    for data in read_objects(filename):
        if 'text' and 'coordinates' in data:
            f.writerow([data['id'], data['geo']['coordinates']])

It is otherwise fine to have multiple JSON strings written to one file, but you need to make sure that the entries are clearly separated somehow. Writing JSON entries that do not use newlines, then using newlines in between them, for example, makes sure you can later on read them one by one again and process them sequentially without this much hassle.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thanks for the effort you put into helping me. I do have a different error now though. When using this I get: TypeError: 'NoneType' object has no attribute '__getitem__' – Andrew Martin Aug 13 '13 at 14:04
  • I don't know if it makes a difference or not, but i know for a fact that not all the tweets I am parsing have the geo and coordinates fields I am looking for. Only some of them do. – Andrew Martin Aug 13 '13 at 14:05
  • @AndrewMartin: Presuming that the exception you get is for the `f.writerow([])` line. If it is on another line, do let me know. – Martijn Pieters Aug 13 '13 at 14:07
  • I tried "if data['geo'] is not None:" but that didn't work right (it ignored some tweets that def had coordinates). The if data.data('geo') caused: AttributeError: 'dict' object has no attribute 'data' – Andrew Martin Aug 13 '13 at 14:15
  • @AndrewMartin: That is entirely correct, because I made a typo there, sorry! I wanted to advice you use `if data.get('geo') is not None:` instead. – Martijn Pieters Aug 13 '13 at 14:17
  • @AndrewMartin: That test works against tweets that do not *have* a `geo` key, or against tweets that do have that key but left it empty. – Martijn Pieters Aug 13 '13 at 14:18
  • But reading between the lines here, I surmise that my method of *reading* your tweets is working, and you have a *new* problem, namely how to process the tweets that you are now reading. Perhaps you want to ask a *new* question about that? – Martijn Pieters Aug 13 '13 at 14:18
  • The thing is, I have three tweet files currently with sample tweets. Two of them have a single line, whilst the third is packed with about fifty lines. The two single line ones work perfectly with your code, but the fifty line one is blank. Should I ask a new question and if so, what title? – Andrew Martin Aug 13 '13 at 14:20
  • Which means that the 50 tweet file is not being read correctly at all; any way you can *share* that file? I have so far been stabbing at this in the dark, really. – Martijn Pieters Aug 13 '13 at 14:23
  • I'm trying to share it now, will comment again with results – Andrew Martin Aug 13 '13 at 14:28
  • Apologies - there's actually a couple of hundred lines in that tweet file, not 50. This is a link to the text files: https://github.com/amartin903/Twitter (note, the big one must be viewed as raw) – Andrew Martin Aug 13 '13 at 14:32
  • @AndrewMartin: I adjusted the code a little; but do note you appear to have *one* tweet per line, still! The code now reads 599 objects from that file. – Martijn Pieters Aug 13 '13 at 14:40
  • I've added the strip() but unfortunately still no luck. If I run the second file, I get one set of coordinates. If I run the other file, I get nothing, just a blank file. – Andrew Martin Aug 13 '13 at 14:45
  • As this comment is getting very long, I've asked a separate question here: http://stackoverflow.com/questions/18212543/trouble-parsing-multiple-json-files-containing-tweets – Andrew Martin Aug 13 '13 at 14:58