1

There is a huge CSV file that is being read by pd.read_table('file.csv', chunksize=50000 ). Currently with each loop iteration I read the value_counts relevant to the current chunk using the df.col.value_counts() method. I got it working through loops and tricks with numpy, but I'm wondering if there is a cleaner way to do this using pandas?

Code:

prev = None

# LOOP CHUNK DATA
for imdb_basics in pd.read_table(
    'data/imdb.title.basics.tsv',
    dtype={'tconst':str,'originalTitle':str,'startYear':str }, 
    usecols=['tconst','originalTitle','startYear'],
    chunksize=50000,
    sep='\t'
):
    # REMOVE NULL DATA & CONVERT TO NUMBER
    imdb_basics.startYear = imdb_basics.startYear.replace( "\\N", 0 )
    imdb_basics.startYear = pd.to_numeric( imdb_basics.startYear )

    # --- loops and tricks --- !

    tmp = imdb_basics.startYear.value_counts( sort=False )

    current = { 
        'year': list( tmp.keys() ),
        'count': list( tmp.values )
    }
    if prev is None : 
        prev = current
    else:
        for i in range( len( prev['year'] ) ):
            for j in range( len( current['year'] ) ):
                if prev['year'][i] == current['year'][j]:
                    prev['count'][i] += current['count'][j]
        for i in range( len( current['year'] ) ):
            if not ( current['year'][i] in prev['year'] ):
                prev['year'].append( current['year'][i] )
                prev['count'].append( current['count'][i] )

EDIT: I'm working with a large data file, plus the remote machine I'm currently using has a very limited amount of memory, so removing chunking in pandas is not an option.

Vlad
  • 3,936
  • 1
  • 13
  • 15
  • `imdb_basics.startYear = imdb_basics.startYear.replace( "\\N", 0 )`...couple of things: 1) don't assign columns with the dot-accessor; 2) you should set the values to null via the `na_values` options in `read_table` and fill them with zeros via the `DataFrame.fillna` method. – Paul H Feb 13 '18 at 15:33
  • lastly, you don't need the inner loop. just set your initial `tmp` variable to `None`. If it's `None`, create it as you are now. Then for the next chunk, when it's not none, just use the `DataFrame.add` method, setting the `axis` and `fill_value` parameters properly. The whole point of pandas is that it takes care of data alignment for you – Paul H Feb 13 '18 at 15:36
  • Didn't see the na_values in the docs, now i have set up the `na_values='\\N'` and using your suggested method that allows me to finally convert my data to numeric values straight away ( i used float as my dtype for startYear, since there is some known issue with integer support for na values - https://stackoverflow.com/questions/21287624/convert-pandas-column-containing-nans-to-dtype-int ) – Vlad Feb 13 '18 at 15:44

2 Answers2

2

Like I said in my comments, you don't need to worry about the key management. Pandas can do all of that for you. Consider this trivial example with some mock data with a year column and some other column:

from io import StringIO
import numpy
import pandas
numpy.random.seed(0)

# years to chose from
years = numpy.arange(2000, 2017)

# relative probabilities of a year being selected (2000 should be absent)
weights = numpy.linspace(0.0, 0.7, num=len(years))
weights /= weights.sum()

# fake dataframe turned into a fake CSV
x = numpy.random.choice(years, size=200, p=weights)
text = pandas.DataFrame({
    'year': x,
    'value': True
}).to_csv()

Since this is a small file, we can read it all at once to get the "correct" answer

pandas.read_csv(StringIO(text))['year'].value_counts().sort_index()
2001     1
2002     6
2003     2
2004     6
2005     6
2006    11
2007     9
2008    12
2009    13
2010     9
2011    18
2012    16
2013    29
2014    20
2015    21
2016    21
Name: year, dtype: int64

OK, so now let's try a chunking approaching, using pandas methods:

result = None
for chunk in pandas.read_csv(StringIO(text), chunksize=25):
    tmp = chunk['year'].value_counts()
    if result is None:  # first chunk
        result = tmp.copy()
    else:  # all other chunks
        result = result.add(tmp, fill_value=0).astype(int)

final = result.sort_index()
final
2001     1
2002     6
2003     2
2004     6
2005     6
2006    11
2007     9
2008    12
2009    13
2010     9
2011    18
2012    16
2013    29
2014    20
2015    21
2016    21
Name: year, dtype: int64

So it works. Pandas will align and fill the index during basic operations.

Paul H
  • 65,268
  • 20
  • 159
  • 136
1

You could try dask.dataframe. It is underused because it only offers a subset of pandas functionality. But if the problem is the ugly syntax via chunking, you could try this:

import dask.dataframe as dd

df = dd.read_csv('my_big_file.csv')
counts = df['col'].value_counts()

counts.compute()

Internally, dask deals with chunking, aggregation, etc.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • Forgot to mention that I'm working with a quite large file plus the remote machine I'm currently using has a very limited amount of memory – Vlad Feb 13 '18 at 15:34
  • @Vlad, the point is `dask` has almost no memory overhead. It does everything lazily. – jpp Feb 13 '18 at 16:00
  • Just tried that approach on my code and it worked, thanks a lot. Although it doesn't match my original question on how to add up two different dataframes, this seems to be the cleaner way to handle this particular problem. – Vlad Feb 13 '18 at 16:57