I have a large file (2GB) of categorical data (mostly "Nan"--but populated here and there with actual values) that is too large to read into a single data frame. I had a rather difficult time coming up with a object to store all the unique values for each column (Which is my goal--eventually I need to factorize this for modeling)
What I ended it up doing was reading the file in chunks into a dataframe and then get the unique values of each column and store them in a list of lists. My solution works, but seemed most un-pythonic--is there a cleaner way to accomplish this in Python (ver 3.5). I do know the number of columns (~2100).
import pandas as pd
#large file of csv separated text data
data=pd.read_csv("./myratherlargefile.csv",chunksize=100000, dtype=str)
collist=[]
master=[]
i=0
initialize=0
for chunk in data:
#so the first time through I have to make the "master" list
if initialize==0:
for col in chunk:
#thinking about this, i should have just dropped this col
if col=='Id':
continue
else:
#use pd.unique as a build in solution to get unique values
collist=chunk[col][chunk[col].notnull()].unique().tolist()
master.append(collist)
i=i+1
#but after first loop just append to the master-list at
#each master-list element
if initialize==1:
for col in chunk:
if col=='Id':
continue
else:
collist=chunk[col][chunk[col].notnull()].unique().tolist()
for item in collist:
master[i]=master[i]+collist
i=i+1
initialize=1
i=0
after that, my final task for all the unique values is as follows:
i=0
names=chunk.columns.tolist()
for item in master:
master[i]=list(set(item))
master[i]=master[i].append(names[i+1])
i=i+1
thus master[i] gives me the column name and then a list of unique values--crude but it does work--my main concern is building the list in a "better" way if possible.