1

My task is count number of unique value in second column that coresponds each unique value in first column. For example if I have:

A  B
1  a
1  a
1  b
2  a
2  a
2  a

I want to have someting like this:

{1: 2, 2: 1}

But I have a huge csv file and cannt read it whole. So, I use chunksize. How I can do it in a chunk loop?

3 Answers3

0

i would try to do it in the following way:

df = pd.DataFrame()
chunksize = 10**5

for t in pd.read_csv(filename, usecols=['A','B'], chunksize=chunksize):
    df = pd.concat([df, t.drop_duplicates()], ignore_index=True).drop_duplicates()

print(df.groupby(['A'])['B'].nunique())

or if you need a dictionary:

print(df.groupby(['A'])['B'].nunique().to_dict())

PS i'm afraid you can't calculate it in separate chunks, because of possible duplicates in different chunks. So the best idea i currently have is to collect all your data and dropping duplicates on each step - this might help to reduce an amount of data a little bit

PPS if your resulting deduplicated DF doesn't fit into memory, then i would recommend you to have a look at the Apache Spark SQL project, where you can process your data frames on the cluster in a distributed manner.

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • /@MaxU. OP still hasnt stated size of file... Everyone thinks their files are large and special. They might end up causing breaks in future code using above example. – Merlin Jun 05 '16 at 19:14
  • @Merlin, i've extended my answer... Reading CSVs in chunks might help if OP has many duplicates – MaxU - stand with Ukraine Jun 05 '16 at 19:23
  • @Merlin, the point is that dropping duplicates won't change the result set as OP wants to count the __unique__ values, but it might reduce memory consumption _dramatically_ if there are lots of duplicates – MaxU - stand with Ukraine Jun 05 '16 at 19:30
  • @MaxU Thans! It seems to be true. It calculated near 1.5 hour, but I try it on a small file and result is right – OleksandraK Jun 05 '16 at 20:33
  • @OleksandraK, you are very welcome! I would also suggest you to change your storage from CSV to HDF5 if it's possible - it will be [much faster](http://stackoverflow.com/questions/37010212/what-is-the-fastest-way-to-upload-a-big-csv-file-in-notebook-to-work-with-python/37012035#37012035). And as Merlin said if all your files can fit into memory, then you don't need chunking – MaxU - stand with Ukraine Jun 05 '16 at 20:36
0

You can do it with a default dict as follows:

from collections import defaultdict
col_d = defaultdict(list)
with open('myfile', 'r') as infile:
     for line in infile:
          if 'A' in line or 'B' in line:
               continue
          line = line.strip().split('  ')
          if len(col_d) == 0:
               col_d[line[0]].append(line[1])
          elif line[1] in col_d[line[0]]:
               pass
          else:
               col_d[line[0]].append(line[1])

for key, value in col_d.items():
     print '{0}\t{1}'.format(key, len(value))
Seekheart
  • 1,135
  • 7
  • 8
0

If the amount of unique values in B is not that big you could work with defaultdict and set which would look like something this:

from collections import defaultdict
dict = defaultdict(set)
with open('file', 'r') as f:
    for line in f:
      line = line.strip().split('  ')
      col_d[line[0]].add(line[1])
for key in dict:
    dict[key]= len(dict[key])
RainbowRevenge
  • 211
  • 1
  • 7