How to reduce the memory usage and speed up the code

Question

I am using huge dataset with 5 columns and more that 90 million rows. The code works fine with part of the data, but when it comes to the whole I get Memory Error. I read about generators, but it appears very complex for me. Can I get explanation based on this code?

df = pd.read_csv('D:.../test.csv', names=["id_easy","ordinal", "timestamp", "latitude", "longitude"])

df = df[:-1]
df.loc[:,'timestamp'] = pd.to_datetime(df.loc[:,'timestamp'])
pd.set_option('float_format', '{:f}'.format)
df['epoch'] = df.loc[:, 'timestamp'].astype('int64')//1e9
df['day_of_week'] = pd.to_datetime(df['epoch'], unit="s").dt.weekday_name
del df['timestamp']

for day in ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']:
    day_df = df.loc[df['day_of_week'] == day]
    day_df.to_csv(f'{day}.csv', index=False,)

Error appears on the last for loop operation

Sample data:

d4ace40905729245a5a0bc3fb748d2b3    1   2016-06-01T08:18:46.000Z    22.9484 56.7728
d4ace40905729245a5a0bc3fb748d2b3    2   2016-06-01T08:28:05.000Z    22.9503 56.7748

UPDATED

I did this:

chunk_list = []  

for chunk in df_chunk:  
    chunk_list.append(chunk)
df_concat = pd.concat(chunk_list)

I have no idea how to proceed now? How to apply the rest of the code?

Why aren't you parsing each row (and writing it to its corresponding day) separately? — Jason, Sep 29 '19 at 18:45
@Mamed, what gets in the way is skipping the last line `df[:-1]`. Are you able to find out the actual number of rows within a file? What operating system are you on? — RomanPerekhrest, Sep 29 '19 at 20:06
@RomanPerekhrest deleting the last raw, due to NaN. Can not to manually, beavuse csv file is not opening. — Mamed, Sep 29 '19 at 20:11
@Mamed, you mean the last line is an empty line, right? or does it contain any markers? — RomanPerekhrest, Sep 29 '19 at 20:16
can you post the output of `tail -1 test.csv`? Even going through chunks you are still not able to skip the last line from processing — RomanPerekhrest, Sep 29 '19 at 20:21

Massifox · Answer 1 · 2019-09-29T22:40:40.970

My advice is to switch to Dask or Spark.

If you want to continue using pandas, try the following tips to read a CSV file, with pandas.read_csv:

chunksize parameter: that allows you to read a piece of files at a time. For example in your case you could use chunksize equal to a million, you would get 90 chunks and you could operate on each chunk individually.
dtype parameter: with this parameter you can specify the data type of each column simply by passing a dictionary like this: {‘a’: np.float32, ‘b’: np.int32, ‘c’: ‘Int32’}
Pandas could use 64-bit data types, while 32bit might be enough for you. With this trick you could save 50% of the space.

Yuor case study

Try this code:

df_chunks = pd.read_csv('test.csv', chunksize=1000000, iterator=True, 
                         parse_dates=['timestamp'], error_bad_lines=False,
                         dtype={"ordinal":'int32', "latitude": 'float32', "longitude":'float32'})
for chunk in df_chunks:
    # chunk = chunk.apply(...) # process the single chunk 
    for day in ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']:
        day_df = chunk.loc[chunk['day_of_week'] == day]
        day_df.to_csv(f'{day}.csv', mode='a', index=0, header=False)

This way you work on one chunk of data at a time and never work with all the data together. The mode='a' tells pandas to append.

Note1: You do not need pandas.concat here. The only thing iterator and chunksize=1000000 does is to give you a reader object that iterates 1000000-row DataFrames instead of reading the whole thing. Using concat you lose all the advantages of using iterators and loading the whole file into memory exactly like using csv laws without specifying chunksize.

Note2: If the 'MemoryError' error persists, try smaller chunksize.

Can you check my question again? I have applied chunksize and got the results, but do not know what to do now — Mamed, Sep 29 '19 at 20:21

score 1 · Accepted Answer · answered Sep 29 '19 at 21:11

Complex improvements:

to iterate through a (potentially very large) file lazily rather than reading the entire file into memory - specify a chunksize to read_csv call (specifying a number of rows to read at one iteration)
the statement df = df[:-1] is not applicable in iterator approach and assuming that the last line is in bad format 99695386 [space] NaN NaN NaN NaN - we can handle it and skip by specifying option error_bad_lines=False
the statement df.loc[:,'timestamp'] = pd.to_datetime(df.loc[:,'timestamp']) can also be eliminated by using parse_dates=['timestamp'] as an option for pd.read_csv call
we'll append to an existing target csv file applying mode='a' (append to a file)

On practice:

n_rows = 10 * 6  # adjust empirically
reader = pd.read_csv('test.csv', names=["id_easy","ordinal", "timestamp", "latitude", "longitude"], 
                     parse_dates=['timestamp'], chunksize=n_rows, error_bad_lines=False)                               
day_names = ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')

for df in reader: 
    if not df.empty: 
        df['epoch'] = df.loc[:, 'timestamp'].astype('int64') // 1e9 
        df['day_of_week'] = pd.to_datetime(df['epoch'], unit="s").dt.weekday_name 
        del df['timestamp']
        for day in day_names: 
            day_df = df.loc[df['day_of_week'] == day] 
            if not day_df.empty:
                day_df.to_csv(f'{day}.csv', index=False, header=False, mode='a')

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking

1 question. Due to the fact that my last raw has NaN, before using chunk I was adding [:-1] (to delete last row). But now I get the error. Any idea? — Mamed, Sep 29 '19 at 23:40
@Mamed, actually you can not slice `reader` object with `[:-1] ` - reader is an iterator. Your last line should be handled with `error_bad_lines=False` as was explained in my 2nd thesis — RomanPerekhrest, Sep 30 '19 at 07:27
ok. Then here are my errors: `OverflowError: int too big to convert` and `OutOfBoundsDatetime: cannot convert input -9223372037.0 with the unit 's'` — Mamed, Sep 30 '19 at 07:36
for `OutOfBoundsDatetime` - `df['day_of_week'] = pd.to_datetime(df['epoch'], unit="s").dt.weekday_name` And for `OverflowError` - `pandas/_libs/tslib.pyx in pandas._libs.tslib.array_with_unit_to_datetime()` + `pandas/_libs/tslibs/timedeltas.pyx in pandas._libs.tslibs.timedeltas.cast_from_unit()` — Mamed, Sep 30 '19 at 07:41
@Mamed, check this topic https://stackoverflow.com/questions/32888124/pandas-out-of-bounds-nanosecond-timestamp-after-offset-rollforward-plus-adding-a — RomanPerekhrest, Sep 30 '19 at 09:04

score -2 · Answer 3 · edited Oct 08 '19 at 00:02

-2

You can use something like pypy (which is an alternative implementation of Python that is not compatible with many packages but faster and has better memory management). It did not use to support pandas (so you would need to iterate through each row, but pypy is very fast at doing this), but I believe if you use a version since this release that now it can be used with pandas.

edited Oct 08 '19 at 00:02

Massifox

4,369
11
31

answered Sep 29 '19 at 18:28

Maddo

175
1
6

How to reduce the memory usage and speed up the code

Sample data:

UPDATED

3 Answers3

Yuor case study

Linked