1

I have csv file with 60M plus rows. I am only interested in a subset of these and would like to put them in a dataframe.

Here is the code I am using:

iter_csv = pd.read_csv('/Users/xxxx/Documents/globqa-pgurlbymrkt-Report.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['Site Market (evar13)'].str.contains("Canada", na=False)] for chunk in iter_csv]) 

off the answer here : pandas: filter lines on load in read_csv

I get the following error:

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

Cant seem to figure out whats wrong and will appreciate guidance here.

Community
  • 1
  • 1
0xsegfault
  • 2,899
  • 6
  • 28
  • 58

1 Answers1

0

Try verifying the data representing a string first. What does the last chunk return that you are expecting to use .contains() on? It seems that the data may be missing and if so then it wouldn't be a string.

  • If this is the case, is there a dropna equivalent for Chunk function? I am aware that its not a data frame and might not have this – 0xsegfault Jun 30 '17 at 15:23
  • Not that I am aware of i'm afraid, my experience with dropna is very limited. It appears you can still request arbitrary samples from each frame using https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html – Jimmie Hansson Jun 30 '17 at 15:31