Efficiently construct Pandas DataFrame from large list of tuples/rows

Question

I've inherited a data file saved in the Stata .dta format. I can load it in with scikits.statsmodels genfromdta() function. This puts my data into a 1-dimensional NumPy array, where each entry is a row of data, stored in a 24-tuple.

In [2]: st_time = time.time(); initialload = sm.iolib.genfromdta("/home/myfile.dta"); ed_time = time.time(); print (ed_time - st_time)
666.523324013

In [3]: type(initialload)
Out[3]: numpy.ndarray

In [4]: initialload.shape
Out[4]: (4809584,)

In [5]: initialload[0]
Out[5]: (19901130.0, 289.0, 1990.0, 12.0, 19901231.0, 18.0, 40301000.0, 'GB', 18242.0, -2.368063, 1.0, 1.7783716290878204, 4379.355, 66.17669677734375, -999.0, -999.0, -0.60000002, -999.0, -999.0, -999.0, -999.0, -999.0, 0.2, 371.0)

I am curious if there's an efficient way to arrange this into a Pandas DataFrame. From what I've read, building up a DataFrame row-by-row seems quite inefficient... but what are my options?

I've written a pretty slow first-pass that just reads each tuple as a single-row DataFrame and appends it. Just wondering if anything else is known to be better.

Does `pandas.DataFrame(initialload)` return what you are searching for? — eumiro, Jul 10 '12 at 14:38
Wow. Almost. It goofed up some column names, but I can easily fix that. Crazy. Thank you, I would never have guessed that even after reading the Pandas docs. Sorry this was so simple. — ely, Jul 10 '12 at 14:41

score 21 · Accepted Answer · edited Jan 21 '19 at 21:22

21

pandas.DataFrame(initialload, columns=list_of_column_names)

edited Jan 21 '19 at 21:22

cs95

379,657
97
704
746

answered Jul 10 '12 at 14:44

eumiro

207,213
34
299
261

score 3 · Answer 2 · answered Sep 09 '13 at 03:23

Version 0.12 of pandas onwards should support loading Stata format directly (Reference).

From the documentation:

The top-level function read_stata will read a dta format file and return a DataFrame: The class StataReader will read the header of the given dta file at initialization. Its method data() will read the observations, converting them to a DataFrame which is returned:

 pd.read_stata('stata.dta')

Efficiently construct Pandas DataFrame from large list of tuples/rows

2 Answers2

Linked

Related