Convert 2-column counter-like csv file to Python collections.Counter?

Question

I have a comma separated (,) tab delimited (\t), file.

68,"phrase"\t
485,"another phrase"\t
43, "phrase 3"\t

Is there a simple approach to throw it into a Python Counter?

I mean, there is no built-in method for your text-file specifically, but it should be pretty straightforward to parse... — juanpa.arrivillaga, Dec 06 '18 at 00:33
I have tried it out and my solution isn't as trivial as I hoped. I'm thinking maybe someone here has a better attempt. Also, if you search "convert csv columns to counter" or something similar nothing comes up on Google. I thought I'd put something up that people could find. I'll post my answer if nothing better shows up. — GollyJer, Dec 06 '18 at 00:40
That is not really a Counter; it is a list over tuples with a string and an int. Is that what you want? Is the Counter to be dynamic? — dawg, Dec 06 '18 at 02:08

Dani Mesejo · Answer 1 · 2018-12-06T02:20:02.943

1

You could use a dictionary comprehension, is considered more pythonic and it can be marginally faster:

import csv
from collections import Counter


def convert_counter_like_csv_to_counter(file_to_convert):
    with file_to_convert.open(encoding="utf-8") as f:
        csv_reader = csv.DictReader(f, delimiter="\t", fieldnames=["count", "title"])
        the_counter = Counter({row["title"]: int(float(row["count"])) for row in csv_reader})
    return the_counter

edited Dec 06 '18 at 02:20

answered Dec 06 '18 at 02:13

Dani Mesejo

61,499
6
49
76

Interesting... this is consistently about 10% faster. ➕1 – GollyJer Dec 06 '18 at 03:14

GollyJer · Accepted Answer · 2018-12-06T04:54:37.040

I couldn't let this go and stumbled on what I think is the winner.

In testing it was clear that looping through the rows of the csv.DictReader was the slowest part; taking about 30 of the 40 seconds.

I switched it to simple csv.reader to see what I would get. This resulted in rows of lists. I wrapped this in a dict to see if it directly converted. It did!

Then I could loop through a native dictionary instead of a csv.DictReader.

The result... done with 4 million rows in 3 seconds!

def convert_counter_like_csv_to_counter(file_to_convert):
    with file_to_convert.open(encoding="utf-8") as f:
        csv_reader = csv.reader(f, delimiter="\t")
        d = dict(csv_reader)
        the_counter = Counter({phrase: int(float(count)) for count, phrase in d.items()})

    return the_counter

GollyJer · Answer 3 · 2018-12-06T02:01:50.143

0

Here's my best attempt. It works but isn't the fastest.
~~Takes about 1.5 minutes to run on a 4 million line input file.~~
Now takes about 40 seconds on a 4 million line input file after the suggestion by Daniel Mesejo.

_{Note: the count value in the csv can be in scientific notation and needs conversion. Hence the int(float( casting.}

import csv
from collections import Counter

def convert_counter_like_csv_to_counter(file_to_convert):

    the_counter = Counter()
    with file_to_convert.open(encoding="utf-8") as f:
        csv_reader = csv.DictReader(f, delimiter="\t", fieldnames=["count", "title"])
        for row in csv_reader:
            the_counter[row["title"]] = int(float(row["count"]))

    return the_counter

edited Dec 06 '18 at 02:01

answered Dec 06 '18 at 01:42

GollyJer

23,857
16
106
174

1

Why not simply `the_counter[row["title"]] = int(float(row["count"]))`. Also are you willing to use pandas? – Dani Mesejo Dec 06 '18 at 01:45
Cool. Updated my answer to reflect your suggestion. That cut the time in half. Thanks! I'd prefer not to use pandas but it can't hurt to have some more answers for people to vote on. – GollyJer Dec 06 '18 at 02:03
Also in the previous comment I miss that you can do `the_counter[row["title"]] = int(row["count"])`. No need for float. – Dani Mesejo Dec 06 '18 at 02:05
No, [you have to cast to float first](https://stackoverflow.com/q/32861429/25197). – GollyJer Dec 06 '18 at 02:07
Are the numbers in scientific notation? – Dani Mesejo Dec 06 '18 at 02:08

Convert 2-column counter-like csv file to Python collections.Counter?

3 Answers3

Linked