8

I have a large file (2GB) of categorical data (mostly "Nan"--but populated here and there with actual values) that is too large to read into a single data frame. I had a rather difficult time coming up with a object to store all the unique values for each column (Which is my goal--eventually I need to factorize this for modeling)

What I ended it up doing was reading the file in chunks into a dataframe and then get the unique values of each column and store them in a list of lists. My solution works, but seemed most un-pythonic--is there a cleaner way to accomplish this in Python (ver 3.5). I do know the number of columns (~2100).

import pandas as pd
#large file of csv separated text data
data=pd.read_csv("./myratherlargefile.csv",chunksize=100000, dtype=str)

collist=[]
master=[]
i=0
initialize=0
for chunk in data:
    #so the first time through I have to make the "master" list
    if initialize==0:
        for col in chunk:
            #thinking about this, i should have just dropped this col
            if col=='Id':
                continue
            else:
                #use pd.unique as a build in solution to get unique values
                collist=chunk[col][chunk[col].notnull()].unique().tolist()
                master.append(collist)
                i=i+1
    #but after first loop just append to the master-list at
    #each master-list element
    if initialize==1:
        for col in chunk:
            if col=='Id':
                continue
            else:
                collist=chunk[col][chunk[col].notnull()].unique().tolist()
                for item in collist:
                    master[i]=master[i]+collist
                i=i+1
    initialize=1
    i=0 

after that, my final task for all the unique values is as follows:

i=0
names=chunk.columns.tolist()
for item in master:
     master[i]=list(set(item))
     master[i]=master[i].append(names[i+1])
     i=i+1

thus master[i] gives me the column name and then a list of unique values--crude but it does work--my main concern is building the list in a "better" way if possible.

RDS
  • 476
  • 1
  • 6
  • 12
  • have you considered using a generator to lazily read the file? (lookup the yield keyword) – salparadise Sep 27 '16 at 05:34
  • That might do the trick. I am not good with iterators and generators yet-- but a cursory glance at the yield keyword as suggested seems to be in the right direction. – RDS Sep 27 '16 at 05:41
  • This is actually using a generator under the hood. The chunksize is doing exactly that. Out of curiosity are you running on a 32 or 64 bit machine/python? `import sys; print(sys.maxsize)' should work so long as you are running python 2.6 – Luke Exton Sep 27 '16 at 05:48
  • im guessing you could probably use something like apache spark to parallelize it if speed is your concern .... – Joran Beasley Sep 27 '16 at 06:12
  • @ Luke This was 64bit Python 3.5 (8GB mem). I may have been able to force it into memory, but as I am working to learn "big data" the important thing is to be able to read something bigger than my memory. This file probably doesn't qualify but it is a good prototype for on my machine at least. @Joran Speed wasn't so much a concern it was more "clean code"--my main motivation was to understand the advanced workings of python. This takes about 10 mins for my file, which is really a 1 time operation. Thank you both for your comments – RDS Sep 27 '16 at 18:29

1 Answers1

8

I would suggest instead of a list of lists, using a collections.defaultdict(set).

Say you start with

uniques = collections.defaultdict(set)

Now the loop can become something like this:

for chunk in data: 
    for col in chunk:
        uniques[col] = uniques[col].union(chunk[col].unique())

Note that:

  1. defaultdict always has a set for uniques[col] (that's what it's there for), so you can skip initialized and stuff.

  2. For a given col, you simply update the entry with the union of the current set (which initially is empty, but it doesn't matter) and the new unique elements.

Edit

As Raymond Hettinger notes (thanks!), it is better to use

       uniques[col].update(chunk[col].unique())
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185