Best way to count unique values from CSV in Python?

Question

I need a quick way of counting unique values from a CSV (its a really big file (>100mb) that can't be opened in Excel for example) and I thought of creating a python script.

The CSV looks like this:

I just need the script to return how many different values are in the file. E.g. for above the desired output would be:

6

So far this is what I have:

import csv
input_file = open(r'C:\Users\guill\Downloads\uu.csv')
csv_reader = csv.reader(input_file, delimiter=',')
thisdict = {
  "UserId": 1
}

for row in csv_reader:
    if row[0] not in thisdict:
        thisdict[row[0]] = 1

print(len(thisdict)-1)

Seems to be working fine, but I wonder if there's a better/more efficient/elegant way to do this?

If the file consists purely of numbers, as you’ve shown, what is the `UserId`? — Abhijit Sarkar, Nov 16 '20 at 12:12
@AbhijitSarkar, you can ignore that. The file used to have 'UserId' as a header but not anymore, so its legacy now. I'll remove that to avoid confusion. That's why there's also a -1 on the output. — Guillermo Gruschka, Nov 16 '20 at 12:15
`print(len(set(open('path/to/file.csv'))))` should do the trick. It prints the number of unique lines in the file. It also is memory-efficient by not reading everything at once but instead reading the file line by line and adding the current line to the set. — Niklas Mertsch, Nov 16 '20 at 12:16
@NiklasMertsch it's discouraged to use `open`, especially just for the sake of a one-liner. — gmdev, Nov 16 '20 at 12:22
@gmdev Who discourages this, and why? If I just want to iterate over the lines of a file in a lazy (i.e. memory-efficient) way, what better option is there? After the line is done, the garbage collector will close the file stream, as there is no reference to the file stream anymore. — Niklas Mertsch, Nov 16 '20 at 12:23
Using `with/open` generally improves readability and overall ease-of-use. It is also safer as you do not have to explicitly say `file.close()` after you are done - it closes it automatically. Also, if there is more that needs to be done with a file before it's closed, it can improve performance by reducing the amount of times you have to open the file. — gmdev, Nov 16 '20 at 12:26
Sure, the line using `open` results in some object having a reference to the file handle returned by `open`, that's a problem and I agree that a context manager (`with open(...) as fp: [do stuff]`) is the way to go. But that's not the case in the line I wrote above. `set(...)` consumes the file stream and as soon as the set is constructed, there is no reference to the file handle, so the garbage collector closes it. — Niklas Mertsch, Nov 16 '20 at 12:35
`print(len(set(open('path/to/file.csv'))))` does this make file kept open? file handle is anonymous, — yosukesabai, Nov 16 '20 at 12:35
@yosukesabai exactly. The file is not kept open, thus I don't know why there should be any problem. — Niklas Mertsch, Nov 16 '20 at 12:36
@NiklasMertsch , you should make it an answer, i like it better at least — yosukesabai, Nov 16 '20 at 19:42
@yosukesabai A magic one-liner is nice for people who know how to do it. They are usually not a good solution for people who need to ask for solutions. That's why I left it as a comment. Step-by-step solutions help inexperienced users much more. — Niklas Mertsch, Nov 16 '20 at 20:08

score 2 · Accepted Answer · answered Nov 16 '20 at 12:33

A set is more tailor-made for this problem than a dictionary:

with open(r'C:\Users\guill\Downloads\uu.csv') as f:
    input_file = f

csv_reader = csv.reader(f, delimiter=',')
uniqueIds = set()

for row in csv_reader:
    uniqueIds.add(row[0])

print(len(uniqueIds))

score 0 · Answer 2 · answered Nov 16 '20 at 12:35

0

use a set instead of a dict, just like this

import csv
input_file = open(r'C:\Users\guill\Downloads\uu.csv')
csv_reader = csv.reader(input_file, delimiter=',')
aa = set()
for row in csv_reader:
    aa.add(row[0])
print(len(aa))

answered Nov 16 '20 at 12:35

jackie zhong

1
1

Best way to count unique values from CSV in Python?

2 Answers2