I hope someone can provide some assistance. I still have a lot to learn about Pandas and this is the first time I've came across a problem like this - working with massive amounts of data. Such that RAM becomes a problem. What's slightly different than other problems I found online, is that others often read large csv files, whereas I read a relatively small csv and from that create lots of data.
This is my code (html version of jupyter notebook).
The basics of what I'm doing:
- Read a csv into a dataframe
- Group the csv by TagName column
- For each tag, resample the data to 1-second resolution
- Interpolate gaps
- Fill back any gaps (this is needed, but not relevant to the problem in question)
- Save data into a new df
- Pivot_table the new df so that TagNames each become a column.
From 5-ish onwards, things get hectic, since I am creating 31 557 600 data points per tag (and there are about 23 tags).
Please, could you help me process this?
Edit
I have created a minimal example. Jupyter notebook file here, the same code below.
My question is: how one one handle this large amount of data, since it quickly gets out of hand as you increase the amount of tags or the data resolution? As is, my PC runs out of RAM with 16GB of RAM (well, not this exact example - this one is smaller to have a working example showing what I want to do). How should one be processing data like this, since it is that much? I don't know if itterrows will work well, since I'm interpolating to fill the values.
import pandas as pd
import numpy as np
# random timestamp from https://stackoverflow.com/questions/553303/generate-a-random-date-between-two-other-dates
def random_date(start, end, position=None):
start, end = pd.Timestamp(start), pd.Timestamp(end)
delta = (end - start).total_seconds()
if position is None:
offset = np.random.uniform(0., delta)
else:
offset = position * delta
offset = pd.offsets.Second(offset)
t = start + offset
return t
#build dummy df
df = pd.DataFrame(np.random.rand(200), columns=['Value'])
df['TagName'] = ''
df.loc[df.index <= 100, 'TagName'] = 'Tag A'
df.loc[df.index > 100, 'TagName'] = 'Tag B'
df['DateTime'] = [random_date(start="2016-01-01 00:00:00", end="2016-03-31 23:59:59") for i in range(df.shape[0])]
df.loc[df['TagName'] == 'Tag A', 'DateTime'] = sorted(df[df['TagName'] == 'Tag A']['DateTime'].values)
df.loc[df['TagName'] == 'Tag B', 'DateTime'] = sorted(df[df['TagName'] == 'Tag B']['DateTime'].values)
#actual processing
# final lists for appending to one big df
# because it's better to create from lists
# than appending in a loop
tagname_list = []
value_list = []
datetime_list = []
groups = df.groupby('TagName')
for tag, d in groups:
# interpolate continous data tags
resampled = d.set_index('DateTime').asfreq('S')
filled = resampled.interpolate('pchip')
#pd.concat([output, filled['Value']])
tagname_list.extend(len(filled) * [tag])
value_list.extend(filled['Value'].values)
datetime_list.extend(filled.index)
del df
this_is_what_im_after = pd.DataFrame({'TagName': tagname_list, 'Value': value_list, 'DateTime': datetime_list}).pivot_table(values='Value', index='DateTime', columns='TagName')