0

I hope someone can provide some assistance. I still have a lot to learn about Pandas and this is the first time I've came across a problem like this - working with massive amounts of data. Such that RAM becomes a problem. What's slightly different than other problems I found online, is that others often read large csv files, whereas I read a relatively small csv and from that create lots of data.

This is my code (html version of jupyter notebook).

The basics of what I'm doing:

  1. Read a csv into a dataframe
  2. Group the csv by TagName column
  3. For each tag, resample the data to 1-second resolution
  4. Interpolate gaps
  5. Fill back any gaps (this is needed, but not relevant to the problem in question)
  6. Save data into a new df
  7. Pivot_table the new df so that TagNames each become a column.

From 5-ish onwards, things get hectic, since I am creating 31 557 600 data points per tag (and there are about 23 tags).

Please, could you help me process this?


Edit

I have created a minimal example. Jupyter notebook file here, the same code below.

My question is: how one one handle this large amount of data, since it quickly gets out of hand as you increase the amount of tags or the data resolution? As is, my PC runs out of RAM with 16GB of RAM (well, not this exact example - this one is smaller to have a working example showing what I want to do). How should one be processing data like this, since it is that much? I don't know if itterrows will work well, since I'm interpolating to fill the values.

import pandas as pd
import numpy as np

# random timestamp from https://stackoverflow.com/questions/553303/generate-a-random-date-between-two-other-dates
def random_date(start, end, position=None):
    start, end = pd.Timestamp(start), pd.Timestamp(end)
    delta = (end - start).total_seconds()
    if position is None:
        offset = np.random.uniform(0., delta)
    else:
        offset = position * delta
    offset = pd.offsets.Second(offset)
    t = start + offset
    return t

#build dummy df
df = pd.DataFrame(np.random.rand(200), columns=['Value'])
df['TagName'] = ''
df.loc[df.index <= 100, 'TagName'] = 'Tag A'
df.loc[df.index > 100, 'TagName'] = 'Tag B'
df['DateTime'] = [random_date(start="2016-01-01 00:00:00", end="2016-03-31 23:59:59") for i in range(df.shape[0])]
df.loc[df['TagName'] == 'Tag A', 'DateTime'] = sorted(df[df['TagName'] == 'Tag A']['DateTime'].values)
df.loc[df['TagName'] == 'Tag B', 'DateTime'] = sorted(df[df['TagName'] == 'Tag B']['DateTime'].values)

#actual processing
# final lists for appending to one big df
# because it's better to create from lists
# than appending in a loop
tagname_list = []
value_list = []
datetime_list = []

groups = df.groupby('TagName')
for tag, d in groups:   
    # interpolate continous data tags
    resampled = d.set_index('DateTime').asfreq('S')
    filled = resampled.interpolate('pchip')    

    #pd.concat([output, filled['Value']])
    tagname_list.extend(len(filled) * [tag])
    value_list.extend(filled['Value'].values)
    datetime_list.extend(filled.index)

del df


this_is_what_im_after = pd.DataFrame({'TagName': tagname_list, 'Value': value_list, 'DateTime': datetime_list}).pivot_table(values='Value', index='DateTime', columns='TagName')
Community
  • 1
  • 1
Mierzen
  • 566
  • 1
  • 5
  • 25
  • 2
    What *specifically* is the problem? – John Zwinck Sep 19 '17 at 12:25
  • The problem is that this needs an insane amount of RAM and gives a MemoryError after a while. I will try to make a minimal example to replace on here. – Mierzen Sep 19 '17 at 12:28
  • Unless "buying more RAM" is an option, you'll have to process your data in manageable pieces. For example, tackle one tag at a time and write the result to disk (and delete the associated dataframe object) – GPhilo Sep 19 '17 at 12:32
  • Post your code here, not on an external website. – blacksite Sep 19 '17 at 12:37
  • Buying more RAM will be crazy, because it gets stuck on the second tag already with 16GB. Doing it in chunks was what I attempted in my second part of my notebook, but that didn't work, hence my asking here. Yes, I will post my new example here. – Mierzen Sep 19 '17 at 12:38
  • Another question is: do you really need to _create_ 31 557 600 data points per tag? – Andras Deak -- Слава Україні Sep 19 '17 at 12:46
  • I don't *really need to*, but I would like to. In any case, this turns out being an exercise to learn ways of dealing with large datasets – Mierzen Sep 19 '17 at 12:48
  • 1
    Have you tried using dask dataframes instead of pandas ? – David Leon Sep 19 '17 at 13:34
  • **I have updated my post**. I have never used dask before, @DavidLeon – Mierzen Sep 19 '17 at 13:48
  • As @Andras said, do you really need to create so much data points? Maybe you could find some other approach to solve your problem or, if it is necessary for you to work with one point per second, could you do the job with smaller periods (each week/ day instead of several month for instance) ? – David Leon Sep 20 '17 at 10:00

0 Answers0