0

I am reading in a large csv file (10GB+). The raw data loaded from the csv looks like:

  SYMBOL      DATE     TIME  PRICE  CORR COND
0     BA  20090501  9:29:46  40.24     0    F
1     BA  20090501  9:29:59  40.38     0    F
2     BA  20090501  9:30:01  40.31     0    O
3     BA  20090501  9:30:01  40.31     0    Q
4     BA  20090501  9:30:08  40.38     0    F

My goal is to combine the DATE and TIME columns into a single DATE_TIME column when reading in the date via the read_csv function.

Loading the data first and doing it manually is not an option due to memory constraints.

Currently, I am using

data = pd.read_csv('200905.csv',
                    parse_dates=[['DATE','TIME']], 
                    infer_datetime_format=True,
                  )

However, using the default dateutil.parser.parser as above increases the loading time by 4x as opposed to just loading the raw csv.

A promising approach could be using the lookup approach in the following: Pandas: slow date conversion. This is because my dataset has a lot of repeated dates.

However, my issue is, how do I optimally exploit the repeated structure of the DATE column while combining into a DATE_TIME column (which is likely to have very few repeated entries).

hs605
  • 1
  • I have no idea if this will actually work. But you could **not** parse dates and do it afterwards. With the `pd.to_datetime` function, you can pass a `format` argument that should speed up parsing **and** you can use the `cache=True` argument to take advantage of the repeated dates. Then your final column could be added to the dataframe. `pd.to_datetime(data.DATE, format='%Y%m%d', cache=True) + pd.to_timedelta(data.TIME)` – piRSquared Jun 11 '19 at 14:49
  • You can use the lookup method for both DATE and TIME separately and then add the two columns. – alec_djinn Jun 11 '19 at 14:53
  • This isn't an answer - nor one you want to hear. But whatever solution you go with will take some time. Better hardware will help, but test your code on sample data. Time it. See which is faster. Then run on your full data set and come back later - it won't be lightning speed – MattR Jun 11 '19 at 14:59

0 Answers0