Pandas dataframe groupby duplicates memory use

Question

I read data from csv. It takes roughly 5Gb RAM (I am judging by the Jupyter notebook mem usage figure and by Linux htop).

df = pd.read_csv(r'~/data/a.txt',  usecols=[0, 1, 5, 15, 16])

then I group it and modify resulting dataframes and delete df.

df.set_index('Date')
y = df.groupby('Date')

days = [(key, value) for key,value in y] 

del df

for day in days:
    day[1].set_index('Time')
    del day[1]['Date']

At this point I would expect groupby to double memory but after del df to release half of it. But in fact it is using 9Gb.

How can I split dataframe by date without duping memory use?

EDIT: since it appeared that python does not release memory to OS, I had to use python memory_profiler to find whats the actual memory use:

print(memory_profiler.memory_usage()[0])

407 << mem use

df = pd.read_csv

4362 <<

groupby and create days list

6351 <<

df = None
gc.collect()

6351 <<

Not related to the question, but... Why are you reading the whole CSV? Why don't you split it and process in chunks? Is it really necessary to use it as a whole? — Alexander Santos, Sep 21 '22 at 15:27
Also, may help as answer to the question: https://stackoverflow.com/a/39377643/10473393 — Alexander Santos, Sep 21 '22 at 15:29
ask yourself please - why would i keep data in memory if i dont need it.. — Boppity Bop, Sep 21 '22 at 15:36
each `value` in `days` still is a view of `df`, so really `del df` doesn't do anything. Before and after `del df` you would still use 5 GB Ram. — Quang Hoang, Sep 21 '22 at 15:56
@QuangHoang I ran memory profiler and it shows that memory use after groupby grew by 2Gb.. I would not mind if it was as you explain but it doesnt meet the fact. — Boppity Bop, Sep 21 '22 at 16:26
And that's the same/different after `del df`? A side note though, why do you turn it into a list `[(day, day_data) ...]? — Quang Hoang, Sep 21 '22 at 16:33
see the edit.. re: why - i dont know any better! :) i am mentally challenged. offer a better way — Boppity Bop, Sep 21 '22 at 16:36
Just like I commented before, `del df` doesn't really do anything. I bet your data is fragmented on day, so the overhead for `groupby` is large. Try putting `groupby` object in the list comprehension to see it improves. Then again, what are you going to use the list for that you can't with `df`? — Quang Hoang, Sep 21 '22 at 16:39
it is in list comprehension. i dont know what "fragmented" is but data is sequentially stamped by date and time, not gaps. I wanted clarity with day data this is why I am splitting it. for sanity purposes.. so far I would like to hear - why is "overhead is large". if you know why please put it in an answer. no point arguing in comments. — Boppity Bop, Sep 21 '22 at 16:49

score 0 · Accepted Answer · answered Sep 21 '22 at 15:57

0

try this instead of grouping by date you can create a df for every date:

unique_date=df["Date"].unique()
days=[]
for date in unique_date:
  days.append(df[df["Date"]==date].set_index("Time"))

answered Sep 21 '22 at 15:57

Mouad Slimane

913
3
12

this doesnt work. it is running for 3 min now. groupby and for statement ran it in 4 sec – Boppity Bop Sep 21 '22 at 16:41

Pandas dataframe groupby duplicates memory use

1 Answers1