0

I have a large csv with the following format:

timestamp,name,age
2020-03-01 00:00:01,nick
2020-03-01 00:00:01,john
2020-03-01 00:00:02,nick
2020-03-01 00:00:02,john
2020-03-01 00:00:04,peter
2020-03-01 00:00:05,john
2020-03-01 00:00:10,nick
2020-03-01 00:00:12,john
2020-03-01 00:00:54,hank
2020-03-01 00:01:03,peter

I load csv into a dataframe with:

df = pd.read_csv("/home/test.csv")

and then I want to create multiple dataframes every 2 seconds. For example:

df1 contains:

2020-03-01 00:00:01,nick
2020-03-01 00:00:01,john
2020-03-01 00:00:02,nick
2020-03-01 00:00:02,john

df2 contains :

2020-03-01 00:00:04,peter
2020-03-01 00:00:05,john

and so on.

I achieve to split timestamps with command below:

full_idx = pd.date_range(start=df['timestamp'].min(), end = df['timestamp'].max(), freq ='0.2T')

but how I can store these spitted dataframes? How can I split a dataset based on timestamps into multiple dataframes?

e7lT2P
  • 1,635
  • 5
  • 31
  • 57

2 Answers2

1

Probably that question can help us: Pandas: Timestamp index rounding to the nearest 5th minute

import numpy as np
import pandas as pd

df = pd.read_csv("test.csv")
df['timestamp'] = pd.to_datetime(df['timestamp'])

ns2sec=2*1000000000   # 2 seconds in nanoseconds 
# next we round our timestamp to every 2nd second with rounding down
timestamp_rounded = df['timestamp'].astype(np.int64) // ns2sec
df['full_idx'] = pd.to_datetime(((timestamp_rounded - timestamp_rounded % 2) * ns2sec))

# store array for each unique value of your idx
store_array = []
for value in df['full_idx'].unique():
    store_array.append(df[df['full_idx']==value][['timestamp', 'name', 'age']])
Artyom Akselrod
  • 946
  • 6
  • 14
  • Thank you for your answer. Why do you use .unique()? – e7lT2P Jun 02 '20 at 07:17
  • @e7lT2P we have new column with values 2020-03-01 00:00:00, 2020-03-01 00:00:02, 2020-03-01 00:00:04 and so on. We want to split your df one time that your dict has only one df1, one df2 and so on. We do that by getting unique values and using filter ```df['full_idx']==value``` – Artyom Akselrod Jun 02 '20 at 08:29
0

How about .resample()?

#first loading your data
>>> import pandas as pd
>>>
>>> df = pd.read_csv('dates.csv', index_col='timestamp', parse_dates=True)
>>> df.head()
                      name  age
timestamp
2020-03-01 00:00:01   nick  NaN
2020-03-01 00:00:01   john  NaN
2020-03-01 00:00:02   nick  NaN
2020-03-01 00:00:02   john  NaN
2020-03-01 00:00:04  peter  NaN

#resampling it at a frequency of 2 seconds
>>> resampled = df.resample('2s')
>>> type(resampled)
<class 'pandas.core.resample.DatetimeIndexResampler'>

#iterating over the resampler object and storing the sliced dfs in a dictionary
>>> df_dict = {}
>>> for i, (timestamp,df) in enumerate(resampled):
>>>     df_dict[i] = df
>>> df_dict[0]
                     name  age
timestamp
2020-03-01 00:00:01  nick  NaN
2020-03-01 00:00:01  john  NaN

Now for some explanation...

resample() is great for rebinning DataFrames based on time (I use it often for downsampling time series data), but it can be used simply to cut up the DataFrame, as you want to do. Iterating over the resampler object produced by df.resample() returns a tuple of (name of the bin,df corresponding to that bin): e.g. the first tuple is (timestamp of the first second,data corresponding to the first 2 seconds). So to get the DataFrames out, we can loop over this object and store them somewhere, like a dict.

Note that this will produce every 2-second interval from the start to the end of the data, so many will be empty given your data. But you can add a step to filter those out if needed.

Additionally, you could manually assign each sliced DataFrame to a variable, but this would be cumbersome (you would probably need to write a line for each 2 second bin, rather than a single small loop). Rather with a dictionary, you can still associate each DataFrame with a callable name. You could also use an OrderedDict or list or whatever collection.


A couple points on your script:

  • setting freq to "0.2T" is 12 seconds (.2 *60); you can rather do freq="2s"
  • The example df and df2 are "out of phase," by that I mean one is binned in 2 seconds starting on odd numbers (1-2 seconds), while one is starting on evens (4-5 seconds). So the date_range you mentioned wouldn't create those bins, it would create dfs from either 0-1s, 2-3s, 4-5s... OR 1-2s,3-4s,5-6s,... depending on which timestamp it started on.

For the latter point, you can use the base argument of .resample() to set the "phase" of the resampling. So in the case above, base=0 would start bins on even numbers, and base=1 would start bins on odds.

This is assuming you are okay with that type of binning - if you really want 1-2 seconds and 4-5 seconds to be in different bins, you would have to do something more complicated I believe.

Tom
  • 8,310
  • 2
  • 16
  • 36
  • df = pd.read_csv('dates.csv', index_col='timestamp', parse_dates=True) seems to remove timestamp column. how to prevent it? – e7lT2P Jun 02 '20 at 07:31
  • You can remove the `index_col='timestamp'` argument - this will leave that column in the `df`. But the data is not gone; using `index_col`, the `timestamp` becomes the index (row labels) of the `DataFrame`, you can still access the values by using `df.index`. It's common to have time series data in the index, and `resample()` automatically works on the `index` (but you can change this by using the `on` argument) – Tom Jun 02 '20 at 14:41