1

I have a comma separated (,) tab delimited (\t), file.

68,"phrase"\t
485,"another phrase"\t
43, "phrase 3"\t

Is there a simple approach to throw it into a Python Counter?

GollyJer
  • 23,857
  • 16
  • 106
  • 174
  • I mean, there is no built-in method for your text-file specifically, but it should be pretty straightforward to parse... – juanpa.arrivillaga Dec 06 '18 at 00:33
  • Should be trivial to code up, why not try it out? – wim Dec 06 '18 at 00:35
  • I have tried it out and my solution isn't as trivial as I hoped. I'm thinking maybe someone here has a better attempt. Also, if you search "convert csv columns to counter" or something similar nothing comes up on Google. I thought I'd put something up that people could find. I'll post my answer if nothing better shows up. – GollyJer Dec 06 '18 at 00:40
  • That is not really a Counter; it is a list over tuples with a string and an int. Is that what you want? Is the Counter to be dynamic? – dawg Dec 06 '18 at 02:08
  • Good point `dawg`. I removed it from the question. – GollyJer Dec 06 '18 at 02:10

3 Answers3

1

You could use a dictionary comprehension, is considered more pythonic and it can be marginally faster:

import csv
from collections import Counter


def convert_counter_like_csv_to_counter(file_to_convert):
    with file_to_convert.open(encoding="utf-8") as f:
        csv_reader = csv.DictReader(f, delimiter="\t", fieldnames=["count", "title"])
        the_counter = Counter({row["title"]: int(float(row["count"])) for row in csv_reader})
    return the_counter
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
1

I couldn't let this go and stumbled on what I think is the winner.

In testing it was clear that looping through the rows of the csv.DictReader was the slowest part; taking about 30 of the 40 seconds.

I switched it to simple csv.reader to see what I would get. This resulted in rows of lists. I wrapped this in a dict to see if it directly converted. It did!

Then I could loop through a native dictionary instead of a csv.DictReader.

The result... done with 4 million rows in 3 seconds!

def convert_counter_like_csv_to_counter(file_to_convert):
    with file_to_convert.open(encoding="utf-8") as f:
        csv_reader = csv.reader(f, delimiter="\t")
        d = dict(csv_reader)
        the_counter = Counter({phrase: int(float(count)) for count, phrase in d.items()})

    return the_counter
GollyJer
  • 23,857
  • 16
  • 106
  • 174
0

Here's my best attempt. It works but isn't the fastest.
Takes about 1.5 minutes to run on a 4 million line input file.
Now takes about 40 seconds on a 4 million line input file after the suggestion by Daniel Mesejo.

Note: the count value in the csv can be in scientific notation and needs conversion. Hence the int(float( casting.

import csv
from collections import Counter

def convert_counter_like_csv_to_counter(file_to_convert):

    the_counter = Counter()
    with file_to_convert.open(encoding="utf-8") as f:
        csv_reader = csv.DictReader(f, delimiter="\t", fieldnames=["count", "title"])
        for row in csv_reader:
            the_counter[row["title"]] = int(float(row["count"]))

    return the_counter
GollyJer
  • 23,857
  • 16
  • 106
  • 174