6

I have a Pandas dataframe that looks similar to this:

datetime                 data1  data2
2021-01-23 00:00:31.140     a1     a2
2021-01-23 00:00:31.140     b1     b2       
2021-01-23 00:00:31.140     c1     c2
2021-01-23 00:01:29.021     d1     d2
2021-01-23 00:02:10.540     e1     e2
2021-01-23 00:02:10.540     f1     f2

The real dataframe is very large and for each unique timestamp, there are a few thousand rows.

I want to save this dataframe to a Parquet file so that I can quickly read all the rows that have a specific datetime index, without loading the whole file or looping through it. How do I save it correctly in Python and how do I quickly read only the rows for one specific datetime?

After reading, I would like to have a new dataframe that contains all the rows for that specific datetime. For example, I want to read only the rows for datetime "2021-01-23 00:00:31.140" from the Parquet file and receive this dataframe:

datetime                 data1  data2
2021-01-23 00:00:31.140     a1     a2
2021-01-23 00:00:31.140     b1     b2        
2021-01-23 00:00:31.140     c1     c2

I am wondering it it may be first necessary to convert the data for each timestamp into a column, like this, so it can be accessed by reading a column instead of rows?

2021-01-23 00:00:31.140  2021-01-23 00:01:29.021  2021-01-23 00:02:10.540
           ['a1', 'a2']             ['d1', 'd2']             ['e1', 'e2']
           ['b1', 'b2']                      NaN             ['f1', 'f2']
           ['c1', 'c2']                      NaN                      NaN

       

I appreciate any help, thank you very much in advance!

Mike
  • 155
  • 2
  • 8

1 Answers1

3

One solution is to index your data by time and use dask, here's an example:

import dask
import dask.dataframe as dd

df = dask.datasets.timeseries(
    start='2000-01-01',
    end='2000-01-2',
    freq='1s',
    partition_freq='1h')

df

print(len(df))
# 86400 rows across 24 files/partitions

%%time
df.loc['2000-01-01 03:40'].compute()
# result returned in about 8 ms

Working with a transposed dataframe like you suggest is not optimal, since you will end up with thousands of columns (if not more) that are unique to each file/partition.

So on your data the workflow would look roughly like this:

import io
data = io.StringIO("""
datetime|data1|data2
2021-01-23 00:00:31.140|a1|a2
2021-01-23 00:00:31.140|b1|b2       
2021-01-23 00:00:31.140|c1|c2
2021-01-23 00:01:29.021|d1|d2
2021-01-23 00:02:10.540|e1|e2
2021-01-23 00:02:10.540|f1|f2""")

import pandas as pd
df = pd.read_csv(data, sep='|', parse_dates=['datetime'])

# make sure the date time column was parsed correctly before
# setting it as an index
df = df.set_index('datetime')


import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=3)

ddf.to_parquet('test_parquet')
# note this will create a folder with one file per partition

ddf2 = dd.read_parquet('test_parquet')


ddf2.loc['2021-01-23 00:00:31'].compute()

# if you want to use very precise time, first convert it to datetime format
ts_exact = pd.to_datetime('2021-01-23 00:00:31.140')
ddf2.loc[ts_exact].compute()
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • Thank you very much for your suggestion, I have tested it. If I understand you correctly, I get one parquet file for each unique timestamp? That would create over 2 million files in my case. I think that's too much for the file system... If I would do it like this, I would not need Parquet. I could just create a simple CSV file with the data for each timestamp. But it will be too many files... My use case is that I only need to access a small subset of the timestamps. Maybe a database would be the better solution then? – Mike Mar 07 '21 at 09:18
  • 1
    No, the number of unique stamps per partition is determined by `npartitions=3`, but if you want to explicitly to control how the partitioning is done you can use `partition_on` option in `.to_parquet()`. – SultanOrazbayev Mar 07 '21 at 09:31
  • Excellent, thank you! May I ask you to show me how I can access the partitions to read the rows for one timestamp into a new Pandas dataframe? – Mike Mar 07 '21 at 12:44
  • Your df.loc example above requires the whole dataframe to be in memory. But that's not what I need. I want to read only the rows that have a certain timestamp from disk. – Mike Mar 07 '21 at 12:50
  • 1
    In the first example, what is loaded into memory is meta data about each partition, not the whole dataframe, so if you want to bring something into memory you will need to use `.compute()`. So something like: `df_new = df.loc['2000-01-01 03:40'].compute()`, where you can specify the range you want. – SultanOrazbayev Mar 07 '21 at 12:52
  • I have tried this: `ddf2 = dd.read_parquet('test_parquet')` `df_new = df.loc['2021-01-23 00:00:31.140'].compute()` But it does not work... – Mike Mar 07 '21 at 13:31
  • Thank you, Unfortunately, my data requires milliseconds. There can be more than one different timestamp (different milliseconds) within the same second. If that is not supported by dask, I may need to convert the timestamp index to a string first? – Mike Mar 07 '21 at 13:42
  • 1
    see the updated answer (second snippet) for further details – SultanOrazbayev Mar 07 '21 at 13:43