-1

I'm trying to do something very simple: I have a recurring csv file that may have repetitions of emails and I need to find how many times each email is repeated, so I did as follows:

file = open('click.csv')
reader = csv.reader(file)

for row in reader:
    email = row[0]
    print (email) # just to check which value is parsing
    counter = 1
    for row in reader:
        if email == row[0]:
            counter += 1
            print (counter) # just to check if it counted correctly

and it only prints out:

firstemailaddress

2

Indeed there are 2 occurrencies of the first email but somehow this stops after the first email in the csv file.

So I simplified it to

for row in reader:
   email = row[0]
   print (email)

and this indeed prints out all the Email addresses in the csv file

This is a simple nested loop, so what's the deal here?

Of course just checking occurrencies could be done without a script but then I have to process those emails and data related to them and merge it with another csv file so that's why

Many thanks,

7 Answers7

1

As answered already, the problem is that reader is an iterator, so it is only good for a single pass. You can just put all the items in a container, like a list.

However, you only need a single pass to count things. Using a dict the most basic approach is:

counts = {}
for row in reader:
    email = row[0]
    if email in counts:
        counts[email] = 1
    else:
        counts[email] += 1

There are even cleaner ways. For example, using a collections.Counter object, which is just a dict specialized for counting:

import collections
counts = collections.Counter(row[0] for row in reader)

Or even:

counts = collections.counter(email for email, _* in reader)
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • `def email_counter(fpath="click.csv"): with open(fpath) as reader: return collections.Counter(reader.readlines())` Would do it then - thanks for this nice solution! – Gwang-Jin Kim Dec 04 '20 at 10:19
  • @Gwan it would not for this question, as you need to get the email from the line that contains more then 1 thing (its a csv). – Patrick Artner Dec 04 '20 at 10:20
  • 1
    @Gwang-JinKim no, it wouldn't, `csv.reader` objects don't have a `readlines()` method. And as explained above, the op needs the first item from each csv row. Even if the op just wanted to count lines, `collections.Counter(file)` would suffice, no need for `readlines()` (there almost never is, idiomatically, you just use `list(file)` instead of `file.readlines()`, a method which is really only kept around for historical reasons. – juanpa.arrivillaga Dec 04 '20 at 10:27
1

The problem with your first snippet comes down to a misunderstanding of iterators, or how csv.reader works.

Your reader object is an iterator. That means it yields rows, and similar to a generator object, it has a certain "state" between iterations. Every time you iterate over one of its elements - in this case rows, you are "consuming" the next available row, until you've consumed all rows and the iterator is entirely exhausted. Here's an example of a different kind of iterator being exhausted:

Imagine you have a text file, file.txt with these lines:

hello
world
this
is
a
test

Then this code:

with open("file.txt", "r") as file:
    
    print("printing all lines for the first time:")
    
    for line in file:
        # strip the trailing newline character
        print(line.rstrip())

    print("printing all lines for the second time:")

    for line in file:
        # strip the trailing newline character
        print(file.rstrip())

    print("Done!")

Output:

printing all lines for the first time:
hello
world
this
is
a
test
printing all lines for the second time:
Done!
>>> 

If this output surprises you, then it's because you've misunderstood how iterators work. In this case, file is an iterator, that yields lines. The first for-loop exhausts all available lines in the file iterator. This means the iterator will be exhausted by the time we reach the second for-loop, and there are no lines left to print.

The same thing is true for your reader. You're consuming rows from your csv-file for every iteration of your outer for-loop, and then consuming another row from the inner for-loop. You can expect to have your code behave strangely when you consume your rows in this way.

Paul M.
  • 10,481
  • 2
  • 9
  • 15
1

You cannot use the reader that way - it is stream based and cannot be "wound back" as you try it. You also do never close your file.

Reading the file multiple times is not needed - you can get all information with one pass through your file using a dictionary to count any email adresses:

# create demo file
with open("click.csv", "w") as f:
    f.write("email@somewhere, other , value, in , csv\n" * 4)
    f.write("otheremail@somewhere, other , value, in , csv\n" * 2)

Process demo file:

from collections import defaultdict
import csv

emails = defaultdict(int)


with open('click.csv') as file:
    reader = csv.reader(file)
    
    for row in reader:
        email = row[0]
        print (email) # just to check which value is parsing
        emails[email] += 1

for adr,count in emails.items():
    print(adr, count)

Output:

email@somewhere 4
otheremail@somewhere 2

See:

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
0

Not sure I get your question but if you want to have a counter of email you should not have those nested loop and just go for 1 loop with dictionary like:

cnt_mails[email] = cnt_mails.get(email, 0) + 1

this should store the count. your code is not working because you have two loops on the same iterator.

flaxel
  • 4,173
  • 4
  • 17
  • 30
Epykure
  • 1
  • 2
0

The problem is, reader is a handler for the file (it is a stream). You can walk through it only into one direction and not go back. Similar to how generators are "consumed" by walking once through them. But what you need is to iterate again and again - IF you want to use for-loops. But anyway this is not an efficient way. Because actually, you want to not count again those rows which you counted once already.

So for your purpose, the best is to create a dictionary, go row by row and if there is no entry in the dictionary for this email, create a new key for the email and as value the counts.

import csv

file = open('click.csv')
reader = csv.reader(file)

email_counts = {} # initiate dictionary

for row in reader:
    email_counts[row[0]] =  email_counts.get(row[0], 0) + 1

That's it!

email_counts[row] = assigns a new value for that particular email in the dictionary.

the whole trick is in email_counts.get(row, 0) + 1. email_counts.get(row) is nothing else than email_counts[row]. but the additional argument 0 is the default value. So this means: check, if row has an entry in email_counts. If yes, return the value for row in email_counts. Otherwise, if it is first entry, return 0. What ever is returned, increase it by + 1. This does all the equality check and correctly increases the counts for the entry.

Finally email_counts will give you all entries with their counts. And the best: Just by going once through the file!

Gwang-Jin Kim
  • 9,303
  • 17
  • 30
0

Try appending your email ids in a list then follow this:-

import pandas as pd
email_list = ["abc@something.com","xyz@something.com","abc@something.com"]
series = pd.Series(email_list)
print(series.value_counts())

You will get output like:-

abc@something.com    2
xyz@something.com    1
dtype: int64
astrick
  • 190
  • 1
  • 9
-1

The problem is that reader is an iterator and you are depleting it with your second loop.

If you did something like:

with open('click.csv') as file:
   lines = list(csv.reader(file))

for row in lines:
    email = row[0]
    print (email) # just to check which value is parsing
    counter = 1
    for row in lines:
        if email == row[0]:
            counter += 1
            print (counter) # just to check if it counted correctly

You should get what you are looking for.

A simpler implementation:

from collections import defaultdict

counter = defaultdict(int)
with open('click.csv') as file:
   reader = csv.reader(file)
   
   for row in lines:
      counter[row[0]] += 1

# Counter is not a dictionary of each email address and the number of times they are seen.
saquintes
  • 1,074
  • 3
  • 11