0

I need a quick way of counting unique values from a CSV (its a really big file (>100mb) that can't be opened in Excel for example) and I thought of creating a python script.

The CSV looks like this:

431231
3412123
321231
1234321
12312431
634534

I just need the script to return how many different values are in the file. E.g. for above the desired output would be:

6

So far this is what I have:

import csv
input_file = open(r'C:\Users\guill\Downloads\uu.csv')
csv_reader = csv.reader(input_file, delimiter=',')
thisdict = {
  "UserId": 1
}

for row in csv_reader:
    if row[0] not in thisdict:
        thisdict[row[0]] = 1

print(len(thisdict)-1)

Seems to be working fine, but I wonder if there's a better/more efficient/elegant way to do this?

Guillermo Gruschka
  • 167
  • 1
  • 3
  • 16
  • yes. use a set instead of a dict. – Pedro Rodrigues Nov 16 '20 at 12:12
  • If the file consists purely of numbers, as you’ve shown, what is the `UserId`? – Abhijit Sarkar Nov 16 '20 at 12:12
  • @AbhijitSarkar, you can ignore that. The file used to have 'UserId' as a header but not anymore, so its legacy now. I'll remove that to avoid confusion. That's why there's also a -1 on the output. – Guillermo Gruschka Nov 16 '20 at 12:15
  • `print(len(set(open('path/to/file.csv'))))` should do the trick. It prints the number of unique lines in the file. It also is memory-efficient by not reading everything at once but instead reading the file line by line and adding the current line to the set. – Niklas Mertsch Nov 16 '20 at 12:16
  • @NiklasMertsch it's discouraged to use `open`, especially just for the sake of a one-liner. – gmdev Nov 16 '20 at 12:22
  • @gmdev Who discourages this, and why? If I just want to iterate over the lines of a file in a lazy (i.e. memory-efficient) way, what better option is there? After the line is done, the garbage collector will close the file stream, as there is no reference to the file stream anymore. – Niklas Mertsch Nov 16 '20 at 12:23
  • Using `with/open` generally improves readability and overall ease-of-use. It is also safer as you do not have to explicitly say `file.close()` after you are done - it closes it automatically. Also, if there is more that needs to be done with a file before it's closed, it can improve performance by reducing the amount of times you have to open the file. – gmdev Nov 16 '20 at 12:26
  • Sure, the line using `open` results in some object having a reference to the file handle returned by `open`, that's a problem and I agree that a context manager (`with open(...) as fp: [do stuff]`) is the way to go. But that's not the case in the line I wrote above. `set(...)` consumes the file stream and as soon as the set is constructed, there is no reference to the file handle, so the garbage collector closes it. – Niklas Mertsch Nov 16 '20 at 12:35
  • `print(len(set(open('path/to/file.csv'))))` does this make file kept open? file handle is anonymous, – yosukesabai Nov 16 '20 at 12:35
  • @yosukesabai exactly. The file is not kept open, thus I don't know why there should be any problem. – Niklas Mertsch Nov 16 '20 at 12:36
  • @NiklasMertsch , you should make it an answer, i like it better at least – yosukesabai Nov 16 '20 at 19:42
  • 2
    @yosukesabai A magic one-liner is nice for people who know how to do it. They are usually not a good solution for people who need to ask for solutions. That's why I left it as a comment. Step-by-step solutions help inexperienced users much more. – Niklas Mertsch Nov 16 '20 at 20:08

2 Answers2

2

A set is more tailor-made for this problem than a dictionary:

with open(r'C:\Users\guill\Downloads\uu.csv') as f:
    input_file = f

csv_reader = csv.reader(f, delimiter=',')
uniqueIds = set()

for row in csv_reader:
    uniqueIds.add(row[0])

print(len(uniqueIds))
James Shapiro
  • 4,805
  • 3
  • 31
  • 46
0

use a set instead of a dict, just like this

import csv
input_file = open(r'C:\Users\guill\Downloads\uu.csv')
csv_reader = csv.reader(input_file, delimiter=',')
aa = set()
for row in csv_reader:
    aa.add(row[0])
print(len(aa))