Counting unique rows in each group after group by using read_csv with chunk

Question

My task is count number of unique value in second column that coresponds each unique value in first column. For example if I have:

A  B
1  a
1  a
1  b
2  a
2  a
2  a

I want to have someting like this:

{1: 2, 2: 1}

But I have a huge csv file and cannt read it whole. So, I use chunksize. How I can do it in a chunk loop?

How large is your file, on disk, And can you fix output to match examples... What code did you try? — Merlin, Jun 05 '16 at 16:50
Please explain your problem more. I do not understand where that output came from. — Alex Hall, Jun 05 '16 at 16:59
I tried this but I cannt understand how do it correctly in chunk loop — OleksandraK, Jun 05 '16 at 17:12
Again, what is the size of your file on disk. Newbie wrong think that chunksize will speed up processing, It wont unless the file will not fit into memory. — Merlin, Jun 05 '16 at 18:39

MaxU - stand with Ukraine · Accepted Answer · 2016-06-05T19:20:07.633

0

i would try to do it in the following way:

df = pd.DataFrame()
chunksize = 10**5

for t in pd.read_csv(filename, usecols=['A','B'], chunksize=chunksize):
    df = pd.concat([df, t.drop_duplicates()], ignore_index=True).drop_duplicates()

print(df.groupby(['A'])['B'].nunique())

or if you need a dictionary:

print(df.groupby(['A'])['B'].nunique().to_dict())

PS i'm afraid you can't calculate it in separate chunks, because of possible duplicates in different chunks. So the best idea i currently have is to collect all your data and dropping duplicates on each step - this might help to reduce an amount of data a little bit

PPS if your resulting deduplicated DF doesn't fit into memory, then i would recommend you to have a look at the Apache Spark SQL project, where you can process your data frames on the cluster in a distributed manner.

edited Jun 05 '16 at 19:20

answered Jun 05 '16 at 17:13

MaxU - stand with Ukraine

205,989
36
386
419

/@MaxU. OP still hasnt stated size of file... Everyone thinks their files are large and special. They might end up causing breaks in future code using above example. – Merlin Jun 05 '16 at 19:14
@Merlin, i've extended my answer... Reading CSVs in chunks might help if OP has many duplicates – MaxU - stand with Ukraine Jun 05 '16 at 19:23
@Merlin, the point is that dropping duplicates won't change the result set as OP wants to count the __unique__ values, but it might reduce memory consumption _dramatically_ if there are lots of duplicates – MaxU - stand with Ukraine Jun 05 '16 at 19:30
@MaxU Thans! It seems to be true. It calculated near 1.5 hour, but I try it on a small file and result is right – OleksandraK Jun 05 '16 at 20:33
@OleksandraK, you are very welcome! I would also suggest you to change your storage from CSV to HDF5 if it's possible - it will be [much faster](http://stackoverflow.com/questions/37010212/what-is-the-fastest-way-to-upload-a-big-csv-file-in-notebook-to-work-with-python/37012035#37012035). And as Merlin said if all your files can fit into memory, then you don't need chunking – MaxU - stand with Ukraine Jun 05 '16 at 20:36

score 0 · Answer 2 · answered Jun 05 '16 at 17:24

You can do it with a default dict as follows:

from collections import defaultdict
col_d = defaultdict(list)
with open('myfile', 'r') as infile:
     for line in infile:
          if 'A' in line or 'B' in line:
               continue
          line = line.strip().split('  ')
          if len(col_d) == 0:
               col_d[line[0]].append(line[1])
          elif line[1] in col_d[line[0]]:
               pass
          else:
               col_d[line[0]].append(line[1])

for key, value in col_d.items():
     print '{0}\t{1}'.format(key, len(value))

score 0 · Answer 3 · answered Jun 05 '16 at 19:57

If the amount of unique values in B is not that big you could work with defaultdict and set which would look like something this:

from collections import defaultdict
dict = defaultdict(set)
with open('file', 'r') as f:
    for line in f:
      line = line.strip().split('  ')
      col_d[line[0]].add(line[1])
for key in dict:
    dict[key]= len(dict[key])

Counting unique rows in each group after group by using read_csv with chunk

3 Answers3