1

I'm working with data files that have comments before the actual data with parameter information. The last comment always contains the column headers but the number of comments is variable.

Some example data files:

# Model type:               AE9
# Particle species:         e-
# Flux type:                1PtDiff
# Flux direction:           Omnidirectional
# Energy levels (MeV):      0.04 0.07 0.1 0.25 0.5 0.75 1 1.5
#
# Accumulation mode:        Interval, length = 86400 seconds (1 days)   
#
# Time format:              Year, day_of_year, GMT_seconds_of_day
# Coordinate system:        GDZ (WGS84 Geodetic) in km
# Data Delimiter:           comma
#
# datetime(year,ddd,gmtsec),alt(km),lat(deg),lon(deg),fluence
2015,1,43200.00,0.00000,0.000000,0.000000,2.73859496e+012

Another:

# Model type:               AP9
# Particle species:         H+
# Flux type:                1PtDiff
# Flux direction:           Omnidirectional
# Energy levels (MeV):       0.1 0.2 0.4 0.6
#
# Time format:              Year, Month, Day, Hour, Minute, Seconds
# Coordinate system:        GDZ (WGS84 Geodetic) in km
# Data Delimiter:           comma
#
# datetime(year,mon,day,hr,min,sec),alt(km),lat(deg),lon(deg),flux
2015,1,1,0,0,0.01,1464.49524,0.000000,259.862461,1.77883217e+004

An example with only one column corresponding to datetime information:

# Model type:               AP9
# Particle species:         H+
# Flux type:                1PtDiff
# Flux direction:           Omnidirectional
# Energy levels (MeV):       0.1 0.2 0.4 0.6
#
# Time format:              Modified Julian Date
# Coordinate system:        GDZ (WGS84 Geodetic) in km
# Data Delimiter:           comma
#
# datetime(mjd),alt(km),lat(deg),lon(deg),flux
57023.00000000,1464.49524,0.000000,259.862461,1.77883219e+004

When I create a dataframe like this:

data_frame = pd.read_csv(
                 file_name,
                 skiprows    = params_dict['Header Line'],
                 sep         = ',',
                 escapechar  = '#',
                 memory_map  = True,
                 index_col   = 0
)

it returns the dataframe as desired only if there are no commas (the delimiter) inside the datetime parentheses such as this header line:

# datetime(mjd),alt(km),lat(deg),lon(deg),flux

In the dataframe, the output for a datetime without commas looks like this:

                               alt(km)   lat(deg)    lon(deg)          flux
datetime(mjd)
57023.000000                1464.49524   0.000000  259.862461  17788.321900

It goes haywire with ones such as # datetime(year,ddd,gmtsec) such as this:

                ddd  gmtsec)   alt(km)  lat(deg)  lon(deg)       fluence  
 datetime(year
 2015              1  43200.0      0.0       0.0       0.0  1.937139e+09

Obviously, it's reading each part of the datetime as another column name because of the delimiter.

The longest datetime type has six columns parts like so:

# datetime(year,mon,day,hr,min,sec)

Desired output:

                            alt(km)  lat(deg)  lon(deg)       fluence  
 datetime(year,ddd,gmtsec) 
 2015,1,43200.0                 0.0       0.0       0.0  1.937139e+09

How can I ignore delimiters inside parentheses in the column headers line while calling pd.read_csv()? I cannot alter the data files themselves. My goal is to parse the column names so that datetime(year,ddd,gmtsec) or datetime(year,mon,day,hr,min,sec) is read as a single column header. I would always like to make my datetime columns to be the index of my dataframe.

My guess is the answer may be found in a regular expression but I do not know how to implement it. Thank you in advance.

vaponteblizzard
  • 186
  • 2
  • 12
  • 2
    Are you trying to parse the header? I'm confused about how the comments come into this. Are there more rows below the header that also have this delimiter-inside-parens behavior? If so, could we see a sample? In general, make sure your [example](https://stackoverflow.com/a/20159305/3888719) has just enough of a sample for us to fully understand the technical problem, and if it's relevant, some representation of what you would liketo see is also helpful. Thanks so much! – Michael Delgado Jun 12 '20 at 16:33
  • Thank you for the feedback, I added more specific context! – vaponteblizzard Jun 12 '20 at 16:45
  • I still don't understand what the values look like. Can you post rows with actual data which demonstrate the problem? – Michael Delgado Jun 12 '20 at 17:18
  • Meant to change my placeholder data, it's fixed now. – vaponteblizzard Jun 12 '20 at 17:36
  • what's wrong with the last snippet (the one in which ddd is one of the columns?) It would be great if you could include: (1) sample input; (2) current output; (3) expected output. – Roy2012 Jun 12 '20 at 18:33
  • 1
    Why not concat the three first fields AFTER you read them from the input? – Roy2012 Jun 12 '20 at 18:53
  • @Roy2012 The datetime formats are variable, the file could have up to six datetime fields depending. I've added this note to the question. – vaponteblizzard Jun 12 '20 at 19:50
  • I think I have an answer but it will take me a little bit – vaponteblizzard Jun 12 '20 at 20:16
  • If you can't predict the number of fields in the datetime "column" and there's no other systematic demarcation of datetime vs. non-datetime columns then there's nothing we can do within pandas. You could read the file line by line with the csv module and assign `row[:-4]` to datetime and `row[-4:]` to the other fields. – Michael Delgado Jun 12 '20 at 20:16
  • 1
    @vaponteblizzard - could you please add sample data that demonstrates the issue with the different date formats? – Roy2012 Jun 13 '20 at 03:24
  • Agreed. Do you have multiple files, each with different formats, but within each file the format is internally consistent with the header? Please post multiple valid examples of the types of situations you're trying to handle. It's an interesting question and I want to help out but really don't know exactly what the problem is - thanks! :) – Michael Delgado Jun 14 '20 at 00:59
  • 1
    Hi everyone, I have added data file examples showing the header lines in context. I hope these clarify the situation! @MichaelDelgado, correct. The format will always be consistent with the header but the header will vary depending on the file. – vaponteblizzard Jun 16 '20 at 17:19
  • ahhhhh got it. thanks so much for the additional examples! follow-up question - do you really need the index to be a string with comma-separated values? or do you want this converted to a [`pd.DatetimeIndex`](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#indexing) or a [`pd.MultiIndex`](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#multiindex-advanced-indexing)? Those would have significantly better performance and features compared with the string index in your example. – Michael Delgado Jun 16 '20 at 19:09

1 Answers1

3

I already had a function grabbing all the parameters from a file so I added a regex to find the first pair of parentheses in my header line. I counted how many commas were inside those parentheses to determine how many datetime columns the file has. I then used that count in a list comprehension to create a list of ordinal indices to pass into data_frame.columns() later. I set my dataframe's index to all the datetime columns creating a MultiIndex.

Input:

import re

import pandas as pd

def get_params():
    params_dict = {}

    with open('data.csv') as current_file:
        for i, line in enumerate(current_file):
            if ':' in line:
                dict_key   = line.split(':')[0].strip(' #')
                dict_value = line.split(':')[1].strip()

                params_dict[dict_key] = dict_value

            if 'datetime' in line:
                # Return contents of datetime parentheses as string
                datetime_cols = re.findall(r'\(((?:\w+\,*)+)\)', line)[0]
                dt_cols_count = datetime_cols.count(',')

                # Create list of ordinal indices to set datetime columns in dataframe
                ind_list = [i for i in range(0, dt_cols_count + 1)]
                params_dict['Index List'] = ind_list

                # Grab header line location to determine number of rows to skip in creating dataframe                
                params_dict['Header Line'] = i

                return params_dict

params_dict = get_params()

data_frame = pd.read_csv(
                 'data.csv',
                 skiprows    = params_dict['Header Line'],
                 sep         = ',',
                 escapechar  = '#',
                 memory_map  = True,
)

data_frame.set_index(list(data_frame.columns[params_dict['Index List']]), inplace = True)

print(data_frame)

Output:

                                       alt(km)   lat(deg)    lon(deg)          flux
datetime(year  mon  day  hr min sec)
2015           1   1   0  0   0.01  1464.49524   0.000000  259.862461  17788.321700

This works for all the datetime formats I am using but I am sure it is far from the best solution.

vaponteblizzard
  • 186
  • 2
  • 12