I'm working with data files that have comments before the actual data with parameter information. The last comment always contains the column headers but the number of comments is variable.
Some example data files:
# Model type: AE9
# Particle species: e-
# Flux type: 1PtDiff
# Flux direction: Omnidirectional
# Energy levels (MeV): 0.04 0.07 0.1 0.25 0.5 0.75 1 1.5
#
# Accumulation mode: Interval, length = 86400 seconds (1 days)
#
# Time format: Year, day_of_year, GMT_seconds_of_day
# Coordinate system: GDZ (WGS84 Geodetic) in km
# Data Delimiter: comma
#
# datetime(year,ddd,gmtsec),alt(km),lat(deg),lon(deg),fluence
2015,1,43200.00,0.00000,0.000000,0.000000,2.73859496e+012
Another:
# Model type: AP9
# Particle species: H+
# Flux type: 1PtDiff
# Flux direction: Omnidirectional
# Energy levels (MeV): 0.1 0.2 0.4 0.6
#
# Time format: Year, Month, Day, Hour, Minute, Seconds
# Coordinate system: GDZ (WGS84 Geodetic) in km
# Data Delimiter: comma
#
# datetime(year,mon,day,hr,min,sec),alt(km),lat(deg),lon(deg),flux
2015,1,1,0,0,0.01,1464.49524,0.000000,259.862461,1.77883217e+004
An example with only one column corresponding to datetime information:
# Model type: AP9
# Particle species: H+
# Flux type: 1PtDiff
# Flux direction: Omnidirectional
# Energy levels (MeV): 0.1 0.2 0.4 0.6
#
# Time format: Modified Julian Date
# Coordinate system: GDZ (WGS84 Geodetic) in km
# Data Delimiter: comma
#
# datetime(mjd),alt(km),lat(deg),lon(deg),flux
57023.00000000,1464.49524,0.000000,259.862461,1.77883219e+004
When I create a dataframe like this:
data_frame = pd.read_csv(
file_name,
skiprows = params_dict['Header Line'],
sep = ',',
escapechar = '#',
memory_map = True,
index_col = 0
)
it returns the dataframe as desired only if there are no commas (the delimiter) inside the datetime parentheses such as this header line:
# datetime(mjd),alt(km),lat(deg),lon(deg),flux
In the dataframe, the output for a datetime without commas looks like this:
alt(km) lat(deg) lon(deg) flux
datetime(mjd)
57023.000000 1464.49524 0.000000 259.862461 17788.321900
It goes haywire with ones such as # datetime(year,ddd,gmtsec)
such as this:
ddd gmtsec) alt(km) lat(deg) lon(deg) fluence
datetime(year
2015 1 43200.0 0.0 0.0 0.0 1.937139e+09
Obviously, it's reading each part of the datetime as another column name because of the delimiter.
The longest datetime type has six columns parts like so:
# datetime(year,mon,day,hr,min,sec)
Desired output:
alt(km) lat(deg) lon(deg) fluence
datetime(year,ddd,gmtsec)
2015,1,43200.0 0.0 0.0 0.0 1.937139e+09
How can I ignore delimiters inside parentheses in the column headers line while calling pd.read_csv()
? I cannot alter the data files themselves. My goal is to parse the column names so that datetime(year,ddd,gmtsec)
or datetime(year,mon,day,hr,min,sec)
is read as a single column header. I would always like to make my datetime columns to be the index of my dataframe.
My guess is the answer may be found in a regular expression but I do not know how to implement it. Thank you in advance.