Organizing column and header data with pandas, python

Question

I'm having a go at using Numpy instead of Matlab, but I'm relatively new to Python.

My current challenge is importing the data in multiple file in a sensible way so that I can use and plot it. The data is organized in columnes (Temperature, Pressure, Time, etc, each file being a measurement period), and I decided pandas was probably the best way to import the data. I was thinking of using top-leve descriptor for each file, and subdescriptors for each column. Thought of doing it something like this. Reading Multiple CSV Files into Python Pandas Dataframe

The problem is I'd like to retain and use some of the data in the header (for plotting, for instance). There's no column titles, but general info on data mesaurements, something like this:

 Flight ID: XXXXXX
 Date: 01-27-10  Time: 5:25:19
 OWNER
 Release Point: xx.304N  xx.060E  11 m
 Serial Number xxxxxx
 Surface Data:  985.1 mb   1.0 C 100%   1.0 m/s @ 308 deg.

I really don't know how to extract and store the data in a way that makes sense when combined with the data frame. Thought of perhaps a dictionary, but I'm not sure how to split the data efficiently since there's no consistent divider. Any ideas?

Are you asking to parse your header and use as column names? It's a little unclear what you desire, can you explain a little more — EdChum, Sep 10 '14 at 13:17
I'm not surprised, it's a very open question because I had no specific thoughts as to a good solution. I don't want it as headers, I already have those. I don't necessarily need it as part of the dataframe, I really don't see how I would organize that in a sensible way. I could go for something like a dictionary, as long as I can access the information when I need to use it in a plot or get more detail on the data I have in the dataframe, but I'm not sure how to extract the information. — user2207834, Sep 24 '14 at 15:38
You could add this to a df as an attribute : `df.flight_ID = 'XXXXXX'` etc.. however, if the df is copied, the attributes are **NOT** copied so you need to be careful — EdChum, Sep 24 '14 at 15:41

score 1 · Accepted Answer · edited May 23 '17 at 12:24

Looks like somebody is working with radiosondes...

When I pull in my radiosonde data I usually put it in a multi-level indexed dataframe. The levels could be of various forms and orders, but something like FLIGHT_NUM, DATE, ALTITUDE, etc. would make sense. Also, when working with sonde data I too want some additional information that does not necessarily need to be stored within the dataframe, so I store that as additional attributes. If I were to parse your file and then store it I would do something along the lines of this (yes, there are modifications that can be made to "improve" this):

import pandas as pd

with open("filename.csv",'r') as data:
    header = data.read().split('\n')[:5] # change to match number of your header rows
    data = pd.read_csv(data, skiprows=6, skipinitialspace=True, na_values=[-999,'Infinity','-Infinity'])

# now you can parse your header to get out the necessary information
# continue until you have all the header info you want/need; e.g.
flight = header[0].split(': ')[1]
date = header[1].split(': ')[1].split('')[0]
time = header[1].split(': ')[2]

# a lot of the header information will get stored as metadata for me.  
# most likely you want more than flight number and date in your metadata, but you get the point.
data.metadata = {'flight':flight,
                 'date':date}

I presume you have a date/time column (call it "dates" here) within your file, so you can use that to re-index your dataframe. If you choose to use different variables within your multi-level index then the same method applies.

new_index  = [(data.metadata['flight'],r) for r in data.dates]
data.index = pd.MultiIndex.from_tuples(new_index)

You now have a multi-level indexed dataframe.

Now, regarding your "metadata". EdChum makes an excellent point that if you copy "data" you will NOT copy over the metadata dictionary. Also, if you save "data" to a dataframe via data.to_pickle you will lose your metadata (more on this later). If you want to keep your metadata you have a couple options.

Save the data on a flight-by-flight basis. This will allow you to store metadata for each individual flight's file.
Assuming you want to have multiple flights within one saved file: you can add an additional column within your dataframe that hold that information (i.e. another column for flight number, another column for surface temperature, etc.), though this will increase the size of your saved file.
Assuming you want to have multiple flights within one saved file (option 2): You can make your metadata dictionary "keyed" by flight number. e.g.

data.metadata = {FLIGHT1:{'date':date}, FLIGHT2:{'date':date}}

Now to store the metadata. Check you my IO class on storing additional attributes within an h5 file posted here.

Your question was quite broad, so you got a broad answer. I hope this was helpful.

Organizing column and header data with pandas, python

1 Answers1