There is a huge CSV file that is being read by pd.read_table('file.csv', chunksize=50000 )
. Currently with each loop iteration I read the value_counts
relevant to the current chunk using the df.col.value_counts()
method. I got it working through loops and tricks with numpy, but I'm wondering if there is a cleaner way to do this using pandas?
Code:
prev = None
# LOOP CHUNK DATA
for imdb_basics in pd.read_table(
'data/imdb.title.basics.tsv',
dtype={'tconst':str,'originalTitle':str,'startYear':str },
usecols=['tconst','originalTitle','startYear'],
chunksize=50000,
sep='\t'
):
# REMOVE NULL DATA & CONVERT TO NUMBER
imdb_basics.startYear = imdb_basics.startYear.replace( "\\N", 0 )
imdb_basics.startYear = pd.to_numeric( imdb_basics.startYear )
# --- loops and tricks --- !
tmp = imdb_basics.startYear.value_counts( sort=False )
current = {
'year': list( tmp.keys() ),
'count': list( tmp.values )
}
if prev is None :
prev = current
else:
for i in range( len( prev['year'] ) ):
for j in range( len( current['year'] ) ):
if prev['year'][i] == current['year'][j]:
prev['count'][i] += current['count'][j]
for i in range( len( current['year'] ) ):
if not ( current['year'][i] in prev['year'] ):
prev['year'].append( current['year'][i] )
prev['count'].append( current['count'][i] )
EDIT: I'm working with a large data file, plus the remote machine I'm currently using has a very limited amount of memory, so removing chunking in pandas is not an option.