6

I finally have output of data I need from a file with many json objects but I need some help with converting the below output into a single dataframe as it loops through the data. Here is the code to produce the output including a sample of what the output looks like:

original data:

{
"zipcode":"08989",
"current"{"canwc":null,"cig":4900,"class":"observation","clds":"OVC","day_ind":"D","dewpt":19,"expireTimeGMT":1385486700,"feels_like":34,"gust":null,"hi":37,"humidex":null,"icon_code":26,"icon_extd":2600,"max_temp":37,"wxMan":"wx1111"},
"triggers":[53,31,9,21,48,7,40,178,55,179,176,26,103,175,33,51,20,57,112,30,50,113]
}
{
"zipcode":"08990",
"current":{"canwc":null,"cig":4900,"class":"observation","clds":"OVC","day_ind":"D","dewpt":19,"expireTimeGMT":1385486700,"feels_like":34,"gust":null,"hi":37,"humidex":null,"icon_code":26,"icon_extd":2600,"max_temp":37, "wxMan":"wx1111"},
"triggers":[53,31,9,21,48,7,40,178,55,179,176,26,103,175,33,51,20,57,112,30,50,113]
}

def lines_per_n(f, n):
    for line in f:
        yield ''.join(chain([line], itertools.islice(f, n - 1)))

for fin in glob.glob('*.txt'):
    with open(fin) as f:
        for chunk in lines_per_n(f, 5):
            try:
                jfile = json.loads(chunk)
                zipcode = jfile['zipcode']
                datetime = jfile['current']['proc_time']
                triggers = jfile['triggers']
                print pd.Series(jfile['zipcode']), 
                      pd.Series(jfile['current']['proc_time']),\
                      jfile['triggers']          
            except ValueError, e:
                pass
            else:
                pass

Sample output I get when I run the above which I would like to store in a pandas dataframe as 3 columns.

08988 20131126102946 []
08989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179]
08988 20131126102946 []
08989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179]
00544 20131126102946 [178, 30, 176, 103, 179, 112, 21, 20, 48]

So the below code seems a lot closer in that it gives me a funky df if I pass the in the list and Transpose the df. Any idea on how I can get this reshaped properly?

def series_chunk(chunk):
    jfile = json.loads(chunk)
    zipcode = jfile['zipcode']
    datetime = jfile['current']['proc_time']
    triggers = jfile['triggers']
    return jfile['zipcode'],\
            jfile['current']['proc_time'],\
            jfile['triggers']

for fin in glob.glob('*.txt'):
    with open(fin) as f:
        for chunk in lines_per_n(f, 7):
            df1 = pd.DataFrame(list(series_chunk(chunk)))
            print df1.T

[u'08988', u'20131126102946', []]
[u'08989', u'20131126102946', [53, 31, 9, 21, 48, 7, 40, 178, 55, 179]]
[u'08988', u'20131126102946', []]
[u'08989', u'20131126102946', [53, 31, 9, 21, 48, 7, 40, 178, 55, 179]]

Dataframe:

   0               1   2
0  08988  20131126102946  []
       0               1                                                  2
0  08989  20131126102946  [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
       0               1   2
0  08988  20131126102946  []
       0               1                                                  2
0  08989  20131126102946  [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...

Here is my final code and output. How do I capture each dataframe it creates through the loop and concatenate them on the fly as one dataframe object?

for fin in glob.glob('*.txt'):
    with open(fin) as f:
        print pd.concat([series_chunk(chunk) for chunk in lines_per_n(f, 7)], axis=1).T

       0               1                                                  2
0  08988  20131126102946                                                 []
1  08989  20131126102946  [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
       0               1                                                  2
0  08988  20131126102946                                                 []
1  08989  20131126102946  [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
horatio1701d
  • 8,809
  • 14
  • 48
  • 77
  • What is the output you *want*? – Andy Hayden Dec 17 '13 at 22:00
  • exactly as you have it in your read_csv dataframe method with the regex. Just 3 columns with the keys and values from the specified 3 keys. – horatio1701d Dec 17 '13 at 22:09
  • I think you're going to be able to wrap this with a concat, something like: pd.concat([series_chunk(chunk) for chunk in lines_per_n(f, 5)]), where series_chunk is the function returning each row as a Series (the bit in the try/except block). – Andy Hayden Dec 17 '13 at 22:13
  • @AndyHayden Thank you for the help. I'm a little stuck with the final step of concatenating into a df object. I've updated my code and output. Would appreciate any guidance. – horatio1701d Dec 18 '13 at 11:03
  • Appended to my answer, should get you on the right track. – Andy Hayden Dec 18 '13 at 18:57
  • Thank you so much, extremely helpful. I'm getting the following error though: "Index(...) must be called with a collection of some kind, u'20131126102946' was passed" – horatio1701d Dec 18 '13 at 23:08
  • there was a typo when I created the Series, missing []. – Andy Hayden Dec 18 '13 at 23:09
  • Thank you again. Just trying to wrangle the final output I'm getting now (appended). I was expecting df.T to work but it didn't do transpose columns and rows. – horatio1701d Dec 18 '13 at 23:19
  • More bugs from me, you need to use concat with axis=1, updated my answer, sorry! – Andy Hayden Dec 18 '13 at 23:22
  • You are a total rock-star! Thank you! Accepted and upvoted answer. – horatio1701d Dec 18 '13 at 23:24
  • @AndyHayden can I trouble you for one final question before I put this to rest. This is probably incredibly simple but my iteration skills are a little weak. How would concat each dataframe it creates in the loop as one object? – horatio1701d Dec 19 '13 at 00:50
  • One way is to concat them together, tweaked... – Andy Hayden Dec 19 '13 at 00:55
  • I'm not sure how to correctly implement your suggestion. Are you recommending to use either append or concat? I get how the first concat we have will concat the chunks but it seems like now I have to wrap the whole thing in another concat. – horatio1701d Dec 19 '13 at 13:45
  • I think you can't do a list comprehension with a with inside it so I appended result to a list before concat-ing... I guess you could pull out the file_to_series as another function. – Andy Hayden Dec 19 '13 at 19:16

2 Answers2

30

###Note: For those of you arriving at this question looking to parse json into pandas, if you do have valid json (this question doesn't) then you should use pandas read_json function:

# can either pass string of the json, or a filepath to a file with valid json
In [99]: pd.read_json('[{"A": 1, "B": 2}, {"A": 3, "B": 4}]')
Out[99]:
   A  B
0  1  2
1  3  4

Check out the IO part of the docs for several examples, arguments you can pass to this function, as well as ways to normalize less structured json.

If you don't have valid json, it's often efficient to munge the string before reading in as json, for example see this answer.

If you have several json files you should concat the DataFrames together (similar to in this answer):

pd.concat([pd.read_json(file) for file in ...], ignore_index=True)

###Original answer for this example:

Use a lookbehind in the regex for the separator passed to read_csv:

In [11]: df = pd.read_csv('foo.csv', sep='(?<!,)\s', header=None)

In [12]: df
Out[12]: 
       0               1                                                  2
0   8988  20131126102946                                                 []
1   8989  20131126102946  [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
2   8988  20131126102946                                                 []
3   8989  20131126102946  [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
4    544  20131126102946  [178, 30, 176, 103, 179, 112, 21, 20, 48, 7, 5...
5    601  20131126094911                                                 []
6    602  20131126101056                                                 []
7    603  20131126101056                                                 []
8    604  20131126101056                                                 []
9    544  20131126102946  [178, 30, 176, 103, 179, 112, 21, 20, 48, 7, 5...
10   601  20131126094911                                                 []
11   602  20131126101056                                                 []
12   603  20131126101056                                                 []
13   604  20131126101056                                                 []

[14 rows x 3 columns]

As mentioned in the comments you may be able to do this more directly by concat several Series together... It's also going to be a little easier to follow:

def series_chunk(chunk):
    jfile = json.loads(chunk)
    zipcode = jfile['zipcode']
    datetime = jfile['current']['proc_time']
    triggers = jfile['triggers']
    return pd.Series([jfile['zipcode'], jfile['current']['proc_time'], jfile['triggers']])

dfs = []
for fin in glob.glob('*.txt'):
    with open(fin) as f:
        df = pd.concat([series_chunk(chunk) for chunk in lines_per_n(f, 5)], axis=1)
        dfs.append(dfs)

df = pd.concat(dfs, ignore_index=True)

Note: You can also move the try/except into series_chunk.

iruvar
  • 22,736
  • 7
  • 53
  • 82
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • But what if I'm not working from a csv? I need to process a large file of json objects and then produce values from the three keys in my question. I'm trying to bring the data into a dataframe for further processing. – horatio1701d Dec 17 '13 at 21:39
  • @prometheus2305 this is how to create a DataFrame from the output you gave (although not strictly a "csv"!). I suspect it's possible for you to concat some objects together more directly, but difficult without a [short example](http://sscce.org/) – Andy Hayden Dec 17 '13 at 21:43
  • Thank you. I've added all of the code and sample data. Would love to be able to concat the object outputs into a dataframe somehow. Any help would be appreciated. – horatio1701d Dec 17 '13 at 21:54
1
import json
import pandas as pd

Approach - 1 (Simple json to dataframe)

df = pd.read_json('data/simple.json')

Approach - 2 (Nested json to dataframe)

data = []
with open('data.json','r') as f:
    for line in f.readlines():
        row = json.loads(line)
        data.append(row)
data = json.loads(json.dumps(data))
df = pd.json_normalize(data, record_path =['annotation'], meta=['content', 'extras'])