0

I am loading data of size comparable to my memory limits, so I am conscious about efficient indexing and not making copies. I would need to work on columns 3:8 and 9: (also labeled), but combining ranges does not seem to work. Rearranging the columns in the underlying data is needlessly costly (an IO operation). Referencing two dataframes and combining them also sounds like something that would make copies. What is an efficient way to do this?

import numpy as np
import pandas as pd

data = pd.read_stata('S:/data/controls/lasso.dta')
X = pd.concat([data.iloc[:,3:8],data.iloc[:,9:888]])

By the way, if I could read in only half of my data (a random half, even), that would help, again I would not open the original data and save another, smaller copy just for this.

László
  • 3,914
  • 8
  • 34
  • 49

1 Answers1

2
import numpy as np
import pandas as pd

data = pd.read_stata('S:/data/controls/lasso.dta')
cols = np.zeros(len(data.columns), np.dtype=bool)
cols[3:8] = True
cols[9:888] = True
X = data.iloc[:, cols]
del data

This still makes a copy (but just one...). It does not seem to be possible to return a view instead of a copy for this sort of shape (source).

Another suggestion is converting the .dta file to a .csv file (howto). Pandas read_csv is much more flexible: you can specify the columns you are interested in (usecols), and how many rows you would like to read (nrows). Unfortunately this requires a file copy.

Community
  • 1
  • 1
user41047
  • 326
  • 1
  • 4