1

I have a directionary of json files to read, so I use the following code:

test_filelist = os.listdir('myDir') 
df_test_list = [pd.read_json( os.path.join('myDir',file),lines=True ) for file in test_filelist if file.endswith('json') ] 
df_test = pd.concat(df_test_list)

The total size of my directionary is 4.5G, but when I use top to check the memory that my process use, I find that this process use 30G when the read was done. Why this happen? I only read 4.5G json files, but 30G memory had been used, how to avoid this ?

I printed the df_test.info(), it told me that this dataframe only use 177.7 MB memory, why?

nick_liu
  • 415
  • 5
  • 17

2 Answers2

0

Seems like you are storing all data frames in df_test_list and then saving the concatenated data frame in df_test. That way, you have in memory lots of unecessary data. A list of big dataframe objects will be expensive

Avoid saving the first list

df_test = pd.concat([pd.read_json( os.path.join('myDir',file),lines=True ) for file in test_filelist if file.endswith('json')])

or abstract that to a different scope, such as a function.

That way you'll have a peak of memory consumption, but the final memory usage will be lower than your current.

I would also recommend reading this answer with some insight from memory usage measures.

rafaelc
  • 57,686
  • 15
  • 58
  • 82
  • I have tried your method, but it still used 30G finally... Also , I print df_test.info(), it told me that the memory is 177.7MB+, I can't understand it – nick_liu Jul 27 '18 at 06:14
  • try `df_test.memory_usage(deep=True)` Object cols are ignored by default. – fordy Jul 27 '18 at 17:33
0

You can specify the types of the columns and this helps a lot with the memory footprint, particularly with types such as categorical variables (which are generally loaded as object type by default) so that duplicates are mapped to the same location in memory.

You can specify types as follows:

column_types = {'col_a': np.float64,
                'col_b': object,
                'col_c': 'category'}

pd.read_json("path/to/json", dtype=column_types)

For your code, you can also delete df_test_list once you have created df_test to free up memory e.g.

del df_test_list
fordy
  • 2,520
  • 1
  • 14
  • 22