I have a Pandas dataframe that looks similar to this:
datetime data1 data2
2021-01-23 00:00:31.140 a1 a2
2021-01-23 00:00:31.140 b1 b2
2021-01-23 00:00:31.140 c1 c2
2021-01-23 00:01:29.021 d1 d2
2021-01-23 00:02:10.540 e1 e2
2021-01-23 00:02:10.540 f1 f2
The real dataframe is very large and for each unique timestamp, there are a few thousand rows.
I want to save this dataframe to a Parquet file so that I can quickly read all the rows that have a specific datetime index, without loading the whole file or looping through it. How do I save it correctly in Python and how do I quickly read only the rows for one specific datetime?
After reading, I would like to have a new dataframe that contains all the rows for that specific datetime. For example, I want to read only the rows for datetime "2021-01-23 00:00:31.140" from the Parquet file and receive this dataframe:
datetime data1 data2
2021-01-23 00:00:31.140 a1 a2
2021-01-23 00:00:31.140 b1 b2
2021-01-23 00:00:31.140 c1 c2
I am wondering it it may be first necessary to convert the data for each timestamp into a column, like this, so it can be accessed by reading a column instead of rows?
2021-01-23 00:00:31.140 2021-01-23 00:01:29.021 2021-01-23 00:02:10.540
['a1', 'a2'] ['d1', 'd2'] ['e1', 'e2']
['b1', 'b2'] NaN ['f1', 'f2']
['c1', 'c2'] NaN NaN
I appreciate any help, thank you very much in advance!