2

Hello I am trying to create a pandas dataframe from (a list of dicts or a dict of dicts) that has an eventual shape of 60,000 rows and 10,000~ columns

Values of columns are 0 or 1 and really sparse.

The list/dict object creation is fast, but when I do from_dict or from_records I get memory errors. I also tried appending to a dataframe periodically rather than at once and it still didn't work. I also tried changing all individual cells, with no avail.

By the way I am building my python object from a 100 json files that I parse.

How can I go from python objects to dataframes? Maybe I can also use something else. I eventually want to feed it to an sk-learn algorithm.

Kevin
  • 65
  • 1
  • 8

2 Answers2

1

if you have only 0 and 1 as values you should use np.bool or np.int8 as a dtype - this will reduce your memory consumption by at least 4 times.

Here is a small demonstration:

In [34]: df = pd.DataFrame(np.random.randint(0,1,(60000, 10000)))

In [35]: df.shape
Out[35]: (60000, 10000)

In [36]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Columns: 10000 entries, 0 to 9999
dtypes: int32(10000)
memory usage: 2.2 GB

per default pandas uses np.int32 (32 bits or 4 bytes) for integers

let's downcast it to np.int8:

In [39]: df_int8 = df.astype(np.int8)

In [40]: df_int8.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Columns: 10000 entries, 0 to 9999
dtypes: int8(10000)
memory usage: 572.2 MB

it consumes now 572 MB instead of 2.2 GB (4 times less)

or using np.bool:

In [41]: df_bool = df.astype(np.bool)

In [42]: df_bool.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Columns: 10000 entries, 0 to 9999
dtypes: bool(10000)
memory usage: 572.2 MB
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Thanks, that definitely helped, I was also wondering about the from_dict part, would it parse certain data structures faster than others? Like I know flatter data structures would be more efficient than nested ones. – Kevin Jun 05 '16 at 12:14
  • 1
    @Kevin, you are very welcome! :) If you care of speed of reading data from disk to pandas then you may want to check this [answer](http://stackoverflow.com/questions/37010212/what-is-the-fastest-way-to-upload-a-big-csv-file-in-notebook-to-work-with-python/37012035#37012035) – MaxU - stand with Ukraine Jun 05 '16 at 14:00
0

Another thing you might try is to enable pyarrow.

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

This sped up my calls to pd.DataFrame by an order of magnitude!

(Note that to use pyarrow, you must use pyspark>=3.0.0 if you use a newer pyarrow (ex: pyarrow>=1.0.0). For pyspark==2.x, it's easiest if you use pyrrrow==0.15.x.)

K.S.
  • 2,846
  • 4
  • 24
  • 32