0

I have a large excel file, where column A has random numbers in each cell, for all the 1 million+ rows in that excel sheet. I am trying to print number of occurrences of each of those values in column B. Usually, I do this in Excel itself, by sorting the data first and then using COUNTIF formula. But since I have so many rows (1 million), copy pasting the formula in all rows of Column B doesn't seem to work. Excel takes forever to compute and hangs up frequently. I want to now try doing this with Python.

Any ideas to get me started would be very much appreciated!

Update: Here's what I tried:

import csv
import collections
with open ('test.csv','rb') as f:
    reader = csv.reader(f)
    my_list = list(reader)

#print my_list[1000]
counter = collections.Counter(my_list)
print counter

But I get TypeError: unhashable type: 'list'

Can anyone help?

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
dotpy_novice
  • 335
  • 2
  • 4
  • 16

4 Answers4

1

First a bit of advice: your question has all the right content, but the phrasing is quite poor. I am answering it because of the former, but I feel the need to point out the latter so you can avoid getting so many close votes in the future. "Any ideas to get me started would be very much appreciated!" and "Can anyone help?" are not valid questions for SO. The problem here is that they are fluff that detracts from the real question, to the point that most reviewers will see them as trigger phrases. In your case, you actually have a good clear problem statement, a coding attempt that is nearly spot-on, and all you need is help with a specific exception. Next time, phrase your question to be about your error or actual problem, and stay away from vagueness like "can you help?".

Enough of that.

A CSV reader is an iterable over the rows of the CSV. Each row is a list. Therefore, when you do list(reader), you are actually creating a list of lists. In your case, each list contains only one element, but that is irrelevant to the Counter: lists can't be dictionary keys, so you get your exception. Literally all you need to change is to extract the first element of each row before you pass it to the Counter. Replace my_list = list(reader) with any of the following:

my_list = list(r[0] for r in reader)

OR

my_list = [r[0] for r in reader]

OR

counter = collections.Counter(r[0] for r in reader)

The last one creates a generator expression that will be evaluated lazily. It is probably your best option for a very large input since it will not retain the entire data set in memory, only the histogram.

Since the generator is evaluated lazily, you can not evaluate the Counter outside the with block. If you attempt to do so, the file will already have been closed, and the generator will raise an error on the first iteration.

You might get a slight speed boost by using operator.itemgetter instead of an explicit r[0] in any of the expressions above. All combined, the example below is pretty close to what you already have:

import csv
from collections import Counter
from operator import itemgetter

with open ('test.csv','rb') as f:
    reader = csv.reader(f)
    g = itemgetter(0)
    counter = Counter(g(r) for r in reader)
print(counter)
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
1

Consider using pandas, which is simple to use and optimized for large datasets.

Given

import csv
import random

import pandas as pd

For demonstration, here is a csv file with a single column of random numbers:

random.seed(123)
data = [random.randint(0, 100) for _ in range(25)]

# Write data to csv
filename = "discard.csv"
with open(filename, "w+") as f:
    writer = csv.writer(f)
    for row in data:
        writer.writerow([str(row)])

Code

# Read and count
s = pd.read_csv(filename, header=None, index_col=False)[0]
s.value_counts()

Output

34    2
20    2
6     2
71    2
43    2
42    2
98    1
11    1
99    1
4     1
13    1
31    1
48    1
17    1
52    1
55    1
68    1
89    1
0     1
Name: 0, dtype: int64

Apply the latter code to your dataset.

pylang
  • 40,867
  • 14
  • 129
  • 121
0

You could use the same strategy in Python: read the entire sequence of numbers into a list, sort the list and count the number of duplicates.

Julian Neuer
  • 306
  • 1
  • 8
0

The pandas package is a simple way to load Excel data. Then you can use the value_counts() member function of the resulting dataframe. For example,

import pandas as pd
xl = pd.ExcelFile("C:\\Temp\\test.xlsx") # or whatever your filename is
df = xl.parse("Sheet1", header=None)
answer = df[0].value_counts()
print(answer)
lebelinoz
  • 4,890
  • 10
  • 33
  • 56
  • 1
    Given the part about millions of lines, it is probably better not to invoke the additional overhead. OP's suggested solution is much more lightweight than using pandas. – Mad Physicist Mar 21 '18 at 04:32