I am reading in a large csv file (10GB+). The raw data loaded from the csv looks like:
SYMBOL DATE TIME PRICE CORR COND
0 BA 20090501 9:29:46 40.24 0 F
1 BA 20090501 9:29:59 40.38 0 F
2 BA 20090501 9:30:01 40.31 0 O
3 BA 20090501 9:30:01 40.31 0 Q
4 BA 20090501 9:30:08 40.38 0 F
My goal is to combine the DATE and TIME columns into a single DATE_TIME column when reading in the date via the read_csv function.
Loading the data first and doing it manually is not an option due to memory constraints.
Currently, I am using
data = pd.read_csv('200905.csv',
parse_dates=[['DATE','TIME']],
infer_datetime_format=True,
)
However, using the default dateutil.parser.parser
as above increases the loading time by 4x as opposed to just loading the raw csv.
A promising approach could be using the lookup approach in the following: Pandas: slow date conversion. This is because my dataset has a lot of repeated dates.
However, my issue is, how do I optimally exploit the repeated structure of the DATE column while combining into a DATE_TIME column (which is likely to have very few repeated entries).