1

I am trying to import a space separated .dat file using pandas and strip values to make a date. The data looks like this (three rows of data taken from the entire dataset for reference):

2.0140000e+003  1.0000000e+000  1.0000000e+000  0.0000000e+000  0.0000000e+000 0.0000000e+000  2.7454583e+000  1.8333542e+002 -3.3580352e+001
2.0140000e+003  1.0000000e+000  2.0000000e+000  0.0000000e+000  0.0000000e+000  0.0000000e+000 -6.1330625e+000  2.5187292e+002 -1.3752231e+001
2.0140000e+003  1.0000000e+000  3.0000000e+000  0.0000000e+000  0.0000000e+000  0.0000000e+000 -3.0905729e+001  2.1295208e+002 -2.4507273e+001

The first six numbers make up the date (year, month, day, hour, minute, second).

I can import the data using:

df = pd.read_csv('daily.dat', sep='\s+', header=None)

and it is separated fine.

However, I would like to strip the first six entries of the row into a date. For example, from row one the first six numbers (or first six columns after importing to df) should make:

2014-01-01 00:00:00

Help?

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
spencerchad
  • 155
  • 1
  • 2
  • 8
  • you're probably looking for a solution with [`to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html) – MattR Feb 07 '18 at 21:03

2 Answers2

3

Demo:

when you read a CSV/dat file without column names (header), you'll get a DF with numerical column names like as follows:

In [139]: df
Out[139]:
        0    1    2    3    4    5          6          7          8
0  2014.0  1.0  1.0  0.0  0.0  0.0   2.745458  183.33542 -33.580352
1  2014.0  1.0  2.0  0.0  0.0  0.0  -6.133063  251.87292 -13.752231
2  2014.0  1.0  3.0  0.0  0.0  0.0 -30.905729  212.95208 -24.507273

columns:

In [140]: df.columns
Out[140]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8], dtype='int64')

pd.to_datetime can assemble datetime from multiple columns:

Assembling a datetime from multiple columns of a DataFrame. The keys can be common abbreviations like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals of the same

In [141]: cols = ['year','month','day','hour','minute','second']

In [142]: df['date'] = pd.to_datetime(df.iloc[:, :6].rename(columns=lambda c: cols[c]))

dropping first 6 columns:

In [143]: df = df.iloc[:, 6:]

In [144]: df
Out[144]:
           6          7          8       date
0   2.745458  183.33542 -33.580352 2014-01-01
1  -6.133063  251.87292 -13.752231 2014-01-02
2 -30.905729  212.95208 -24.507273 2014-01-03

alternatively (thanks @Idlehands for the idea) we can drop it this way:

df = df.drop(columns=df.columns[:6])
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 1
    Great answer. I was manually converting the `datetime` but this is much better. One comment is the last drop might be a bit hard to read for some, I would have used `df.drop(columns=list(range(6)), inplace=True)` which is a more verbose. – r.ook Feb 07 '18 at 21:38
  • 1
    @Idlehands, thank you for your comment - i've added it to the answer! – MaxU - stand with Ukraine Feb 07 '18 at 21:43
  • That works! But.. can you explain why you set 'cols'? Are these keys that are built into to_datetime? How come I cannot just send the columns I want to convert to datetime into to_datetime directly? Thanks! – spencerchad Feb 07 '18 at 21:53
  • @spencerchad, it's already in the answer - Pandas `pd.to_datetime()` expect specific column names - `The keys can be common abbreviations like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals of the same` – MaxU - stand with Ukraine Feb 07 '18 at 21:54
  • Right. That is what I wanted to clarify as I did not fully understand, that pd.to_datetime() requires an actual name of a column and you cannot just use a reference to the position. Thank you. – spencerchad Feb 07 '18 at 22:03
  • @MaxU @Idlehands Adding `df.drop(columns=df.columns[:6], inplace=True)` results in this error for me.. `--------------------------------------------------------------------------- TypeError Traceback (most recent call last) in () 3 df['date']=pd.to_datetime(df.iloc[:, :6].rename(columns=lambda c: cols[c])) 4 ----> 5 df.drop(columns=df.columns[0], inplace=True) TypeError: drop() got an unexpected keyword argument 'columns'' ` – spencerchad Feb 07 '18 at 22:08
  • 1
    @spencerchad, most probably you have an older Pandas version - i've tested it under Pandas 0.22.0 – MaxU - stand with Ukraine Feb 07 '18 at 22:09
0

you can try this:

import pandas as pd
from datetime import datetime

df = pd.read_csv('daily.dat', sep='\s+', header=None)

def to_datetime(year,month,day,hour,minute,second):
    return datetime(int(year),int(month),int(day),int(hour),int(minute),int(second))

df['datetime'] = df.apply(lambda x: to_datetime(x[0], x[1], x[2], x[3], x[4], x[5]), axis=1).apply(str)

df.drop(list(range(6)),1,inplace=True)

print(df)

# output:
#           6          7          8             datetime
#0   2.745458  183.33542 -33.580352  2014-01-01 00:00:00
#1  -6.133063  251.87292 -13.752231  2014-01-02 00:00:00
#2 -30.905729  212.95208 -24.507273  2014-01-03 00:00:00
romulomadu
  • 627
  • 6
  • 9