0

Say I need to have data stored as follows:

[[[{}][{}]]]

or a list of lists of two lists of dictionaries

where:

{}: dictionaries containing data from individual frames observing an event. (There are two observers/stations, hence two dictionaries.)

[{}][{}]: two lists of all the individual frames related to a single event, one from each observer/station.

[[{}][{}]]: list of all events on a single night of observation.

[[[{}][{}]]]: list of all nights.

Hopefully that's clear. What I want to do is create two pandas dataframes where all dictionaries from station_1 are stored in one, and all dictionaries from station_2 are stored in the other.

My current method is as follows (where data is the above data structure):

for night in range(len(data)):

    station_1 = pd.DataFrame(data[night][0])
    station_2 = pd.DataFrame(data[night][1])

    all_station_1.append(station_1)
    all_station_2.append(station_2)

all_station_1 = pd.concat(all_station_1)
all_station_2 = pd.concat(all_station_2)

My understanding though is that the for loop must be horribly inefficient since I will be scaling the application of this script way up from my sample dataset this cost could easily become unmanageable.

So, any advice for a smarter way of proceeding would be appreciated! I feel like pandas is so user friendly there's gotta be an efficient way of dealing with any kind of data structure but I haven't been able to find it on my own yet. Thanks!

Sam L.
  • 45
  • 8

1 Answers1

1

I don't think you can really avoid using a loop here, unless you want to invoke jq via sh. See this answer

Anyways, using your full sample, I managed to parse it into a multiindexed dataframe, which I assume is what you want.

import datetime
import re
import json

data=None
with open('datasample.txt', 'r') as f:
    data=f.readlines()
# There's only one line
data=data[0]

# Replace single quotes to double quotes: I did that in the .txt file itself, you could do it using re

# Fix the datetime problem
cleaned_data = re.sub(r'(datetime.datetime\(.*?\))', lambda x: '"'+ str(eval(x.group(0)).isoformat())+'"', data)

Now that the string from the file is valid json, we can load it:

json_data = json.loads(cleaned_data)

And we can process it into a dataframe:

# List to store the dfs before concat
all_ = []
for n, night in enumerate(json_data):
    for s, station in enumerate(night):
        events = pd.DataFrame(station)
        # Set index to the event number
        events = events.set_index('###')
        # Prepend night number and station number to index
        events.index = pd.MultiIndex.from_tuples([(n, s, x) for x in events.index])
        all_.append(events)

df_all = pd.concat(all_)
# Rename the index levels
df_all.index.names = ['Night','Station','Event']
# Convert to datetime
df_all.DateTime = pd.to_datetime(df_all.DateTime)
df_all

(Truncated) Result:

enter image description here

Community
  • 1
  • 1
Julien Marrec
  • 11,605
  • 4
  • 46
  • 63
  • Thank you very much for your time! I follow the majority of what's going on but would like to please ask two further questions: 1. What is the problem that you are fixing with datetime? I can maybe fix it at source since I have access to the script that is used to prepare the data files. 2. How can I replace the single quotes using re? I'm not sure how to avoid my quotes being interpreted as strings. Thanks again! – Sam L. Nov 17 '16 at 13:42
  • It would help of you could format datetime in ISO format string initially. Here I have to eval the datetime.datime(...) into an ISO string for Json. In your file there's `datetime.datetime(2011, 12, 13, 22, 15, 37, 880000)`, I convert that to `"2011-12-13T22:15:37.880000" Also, JSON in strict form requires quotes to be double quotes " not single quotes ', so you could change that in your script to prepare the data. Otherwise you can check [this question](http://stackoverflow.com/questions/4033633/handling-lazy-json-in-python-expecting-property-name) to do it in python after the fact. – Julien Marrec Nov 17 '16 at 13:58