2

I can't figure out a way to reduce memory usage for this program further. Basically, I'm reading from JSON log files into a pandas dataframe, but:

  1. the list append function is what is causing the issue. It creates two different objects in memory, causing huge memory usage.
  2. .to_pickle method of pandas is also a huge memory hog, because the biggest spike in memory is when writing to the pickle.

Here is my most efficient implementation to date:

columns = ['eventName', 'sessionId', "eventTime", "items", "currentPage", "browserType"]
df = pd.DataFrame(columns=columns)
l = []

for i, file in enumerate(glob.glob("*.log")):
    print("Going through log file #%s named %s..." % (i+1, file))
    with open(file) as myfile:
        l += [json.loads(line) for line in myfile]
        tempdata = pd.DataFrame(l)
        for column in tempdata.columns:
            if not column in columns:
                try:
                    tempdata.drop(column, axis=1, inplace=True)
                except ValueError:
                    print ("oh no! We've got a problem with %s column! It don't exist!" % (badcolumn))
        l = []
        df = df.append(tempdata, ignore_index = True)
        # very slow version, but is most memory efficient
        # length = len(df)
        # length_temp = len(tempdata)
        # for i in range(1, length_temp):
        #     update_progress((i*100.0)/length_temp)
        #     for column in columns:
        #         df.at[length+i, column] = tempdata.at[i, column]
        tempdata = 0

print ("Data Frame initialized and filled! Now Sorting...")
df.sort(columns=["sessionId", "eventTime"], inplace = True)
print ("Done Sorting... Changing indices...")
df.index = range(1, len(df)+1)
print ("Storing in Pickles...")
df.to_pickle('data.pkl')

Is there an easy way to reduce memory? The commented code does the job but takes 100-1000x longer. I'm currently at 45% memory usage at max during the .to_pickle part, 30% during the reading of the logs. But the more logs there are, the higher that number goes.

smci
  • 32,567
  • 20
  • 113
  • 146
furby559
  • 33
  • 1
  • 5
  • Searching "inplace panda append", I found this: http://stackoverflow.com/questions/18196616/append-rows-to-a-pandas-dataframe-without-making-a-new-copy – satoru Aug 07 '15 at 22:51
  • I saw that, but my question is multi-pronged; why does the .to_pickle part take up so much memory!? Also, if there is a way to do this without .append, like my second slower but more memory-efficient attempt? – furby559 Aug 07 '15 at 22:58

2 Answers2

9

This answer is for general pandas dataFrame memory usage optimization:

  1. Pandas loads in string columns as object type by default. For all the columns which have the type object, try to assign the type category to these columns by passing a dictionary to parameter dtypes of the read_csv function. Memory usage decreases dramatically for columns with 50% or less unique values.

  2. Pandas reads in numeric columns as float64 by default. Use pd.to_numeric to downcast float64 type to 32 or 16 if possible. This again saves you memory.

  3. Load in csv data chunk by chunk. Process it, and move on to the next chunk. This can be done by specifying value to the chunk_size parameter of read_csv method.

Victor Tang
  • 91
  • 1
  • 2
4

If you need to build a DataFrame up from pieces, it is generally much more efficient to construct a list of the component frames and combine them all in one step using concat. See the first approach below.

# df = 10 rows of dummy data

In [10]: %%time
    ...: dfs = []
    ...: for _ in xrange(1000):
    ...:     dfs.append(df)
    ...: df_concat = pd.concat(dfs, ignore_index=True)
    ...: 
Wall time: 42 ms

In [11]: %%time
    ...: df_append = pd.DataFrame(columns=df.columns)
    ...: for _ in xrange(1000):
    ...:     df_append = df_append.append(df, ignore_index=True)
    ...: 
Wall time: 915 ms
chrisb
  • 49,833
  • 8
  • 70
  • 70