I've a 12 million rows and 9 cols CSV file.
I'm getting a
Keyword
error not aMemoryError
, not a duplicated question.
I need to read it get 2nd lowest rate
for each zipcode
.
I've read that to work with big datasets from CSV files, you need to read them in chunks and apply your code to each chunk.
I've this:
import pandas as pd
import csv
for df in pd.read_csv('slcsp/new_df.csv', sep='\t', iterator=True, chunksize=1000):
df.groupby('zipcode').rate.nsmallest(2).reset_index().drop('level_1',1) \
.drop_duplicates(subset=['zipcode'], keep='last')
But getting error:
KeyError: 'zipcode' #but there is a column called zipcode
I've checked and there is a column named zipcode
.
Traceback (most recent call last):
File "slcsp/slcsp.py", line 19, in <module>
df.loc[df.groupby('zipcode').rate.rank(method='first').eq(2),['zipcode','rate']]
File "D:\virtual_envs\web_scrapping\lib\site-packages\pandas\core\generic.py", line 7632, in groupby
observed=observed, **kwargs)
File "D:\virtual_envs\web_scrapping\lib\site-packages\pandas\core\groupby\groupby.py", line 2110, in groupby
return klass(obj, by, **kwds)
File "D:\virtual_envs\web_scrapping\lib\site-packages\pandas\core\groupby\groupby.py", line 360, in __init__
mutated=self.mutated)
File "D:\virtual_envs\web_scrapping\lib\site-packages\pandas\core\groupby\grouper.py", line 578, in _get_grouper
raise KeyError(gpr)
KeyError: 'zipcode'