1

I am trying to read data from a CSV file (A), extract data, and write that to a different CSV file (B). In the new file B, I want to have two columns. Column 1 to list names of column 1 in file A and column 2 to list the count of column 1 in file A. So for example, if the file A looks like this without ':' (they are lined up in two columns):

Animal: Gender
Rabbit: Male
Dog: Male
Rabbit: Female
Cat: Male
Cat: Male
Dog: Female
Dog: Male
Turtle: Male

I want the output in file B to look like this (actually in different columns without ':' again):

Animal: Count
Cat: 2
Dog: 3
Rabbit: 2
Turtle: 1

This is the very first time I am doing anything like this and this is what I have so far, but I am failing in having the data printed out in file B and have the "count" done correctly. Could anybody please help me with this?

import csv
ReadData=csv.reader(open('C:\Users\..\FileA.csv','rb'), delimiter=',')

def column(ReadData, i):
    return [row[i] for row in ReadData]

for line in ReadData:
   WriteData=csv.writer(open('C:\Users\..\FileB.csv','wb'),
                        delimiter=' ', quotechar=':', quoting=csv.QUOTE_ALL)
   print column(ReadData,1)

Thank you for your help in advance!

owl
  • 1,841
  • 6
  • 20
  • 30
  • This [link](http://stackoverflow.com/editing-help) shows you how to edit/post with markup – Levon Jul 25 '12 at 22:30
  • Thank you for your quick reply! I have been checking this link but I am having trouble putting spaces... I might be missing something... – owl Jul 25 '12 at 22:31
  • For code just (1) paste it in, (2) highlight/select the block of code and then hit (3) Control-K .. it will shift it to the right (I think 4 columns) and make it appear correctly as code. – Levon Jul 25 '12 at 22:32
  • 2
    Thank you all. I think somebody edited the format when I was trying to fix it. No, this is not a homework. I am learning Python from scratch for my research. – owl Jul 25 '12 at 22:36
  • If I were trying to solve this, I'd do all the reading first, probably using a dictionary to keep a running total, and after reading the whole input, write out my dictionary . . . – ernie Jul 25 '12 at 22:38
  • Thanks, I would probably just write it out as a list or dictionary if the data is small but the actual data is really huge that it is not practical to go through the data... I am such a beginner that I am sorry if I am not understanding what you wrote to me correctly... – owl Jul 25 '12 at 22:43
  • Do either of your files contain any actual commas? Why are you treating them as CSV? – Wooble Jul 25 '12 at 22:48
  • To get totals, you're going to have to read all the data . . . I'd suggest opening the input file, reading through it and keeping tallies somewhere (e.g. a dictionary), and then when you've read every row, write out the dictionary. – ernie Jul 25 '12 at 22:48
  • Two things. First, those aren't CSV files in your example; they look like python dicts, maybe? Second, if your question is about efficiently processing large data sets, you might want to describe what size data and its characteristics in the question; reading into a dict would be reasonable for many applications. – Mark Tozzi Jul 25 '12 at 22:49
  • > WoobleActual data contains no comma but I was treating it as CSV because that was the format I got and was recommended for using. – owl Jul 25 '12 at 22:52
  • > ernie What I know about "Dictionary"(http://docs.python.org/library/stdtypes.html#dict) is this. The text I used had only a few sets that I could write them all. Is there a way to write out the dictionary without actually printing it out in Python and just get the output in a file? – owl Jul 25 '12 at 22:54
  • > Mark Tozzi The size of the data is ~30MB in the original text file. I could not paste them all in Excel because it reached 65536th row and so right now I am just taking part of the data to see if it does what I want it to. – owl Jul 25 '12 at 22:56
  • I am currently looking up the way to "read into a dictionary". – owl Jul 25 '12 at 22:57

4 Answers4

2

To do the counting in Python >=2.7, see this example for collections.Counter. With a collections.defaultdict, see here.

In your call to csv.writer, the quotechar=':' is probably an error (this would make WriteData.writerow(['Hello World', 12345]) emit ":Hello World: 12345", as if the colon was a quotation mark.

Also note that your function column(ReadData, i) consumes ReadData; subsequent calls to ReadData will probably return an empty list (not tested). This is not a problem for your code (at least not now).

This is a solution without the CSV module (after all, these files do not look too much like CSV):

import collections

inputfile = file("A")

counts = collections.Counter()

for line in inputfile:
    animal = line.split(':')[0]
    counts[animal] += 1

for animal, count in counts.iteritems():
    print '%s: %s' % (animal, count)
Community
  • 1
  • 1
tiwo
  • 3,238
  • 1
  • 20
  • 33
  • 1
    Better written as `animals = (line.split(':')[0] for line in inputfile); counts = collections.Counter(animals)` – Jon Clements Jul 25 '12 at 22:54
  • 1
    Thank you so much for all the sources! I will try out that! – owl Jul 25 '12 at 23:03
  • Just to be clear, I wanted to put the example of data using ':' above (although it does not exist in the real data) because I could not line the text with spaces or tabs... I am sorry about that. – owl Jul 25 '12 at 23:12
  • @owl, then `line.split()` will split at any whitespace character. – tiwo Jul 25 '12 at 23:20
  • Thanks! In the actual data, I have multiple columns. Is there a way to specify which column I want to look at? – owl Jul 25 '12 at 23:23
  • If they are separated by any number of spaces (or tabs), then the split() splits the line correctly. The subscript [0] selects the first column, and you may use different subscripts (for example, use your `column(...)` function!). For real use, you will want to wrap the code in a function (maybe `def histogram(lines, column)`, since it is a histogram you are computing) and use `...[column]`. – tiwo Jul 25 '12 at 23:29
  • @tiwo, thank you! In the Powershell (I am using Windows), I got close to what I wanted. There are 16 columns and the column I want to count is the 2nd one. So I put animal=line.split()[1] but then I got an error saying "list index out of range". I tried different numbers but the only time it worked was when I entered '0'. So what I got was a list with 1st and 2nd column. – owl Jul 25 '12 at 23:33
  • @tiwo, To make the situation worse, the 2nd column I want to count has a name with a space, for example "white tiger". In the output I got, I only got "white". – owl Jul 25 '12 at 23:34
  • The "list index out of range" when your are requesting a specific column means that one ore more rows are too short - i.e. do not contain that column. I can't decide what you need to do then. -- If there are spaces in the names, you have two options: (1) Maybe you can get the data in some other format? For example, Excel can read fixed-width format and then write CSV. (2) You can also split fixed-width lines, for example `line[10:25]` returns the 15 characters starting with the 10th. – tiwo Jul 25 '12 at 23:43
  • Hmmm. The issue is that the actual data is too big that it cannot be in Excel (goes over 65536th row) and the length of the string in the first column is not constant. – owl Jul 25 '12 at 23:52
1

I will answer the counting part of your question, and perhaps you can combine it with the csv part of your question.

l = [
    ('Animal','Gender'),
    ('Rabbit','Male'),
    ('Dog','Male'),
    ('Rabbit','Female'),
    ('Cat','Male'),
    ('Cat','Male'),
    ('Dog','Female'),
    ('Dog','Male'),
    ('Turtle','Male')
    ]

d = {}
for k,v in l:
    if not k in d:
        d[k] = 1
    else:
        d[k] += 1

for k in d:
    print "%s: %d" % (k,d[k])

I didn't filter your header line, the output of this code is:

Turtle: 1
Cat: 2
Rabbit: 2
Animal: 1
Dog: 3

EDIT:

You can replace this:

if not k in d:
    d[k] = 1
else:
    d[k] += 1

With this:

d[k] = d.setdefault(k,0) + 1
ChipJust
  • 1,376
  • 12
  • 20
0

Have a look at the itertools module and the groupby function. For example:

from itertools import groupby

animals = [
    ('Rabbit', 'Male'),
    ('Dog', 'Male'),
    ('Rabbit', 'Female'),
    ('Cat', 'Male'),
    ('Cat', 'Male'),
    ('Dog', 'Female'),
    ('Dog', 'Male'),
    ('Turtle', 'Male')
    ]

def get_group_key(animal_data):
    return animal_data[0]

animals = sorted(animals, key=get_group_key)
animal_groups = groupby(animals, get_group_key)

grouped_animals = []
for animal_type in animal_groups:
    grouped_animals.append((animal_type[0], len(list(animal_type[1]))))

print grouped_animals

>>> [('Cat', 2), ('Dog', 3), ('Rabbit', 2), ('Turtle', 1)]
Mark Gemmill
  • 5,889
  • 2
  • 27
  • 22
  • 1
    If a group of animals is not entirely contiguous - this will produce incorrect results (see "Rabbit" in your results above). Note that `sum(1 for _ in iterable)` is one way to get the length of an iterator without materialising a list or other sequence – Jon Clements Jul 25 '12 at 22:59
  • 1
    Thank you for your help! I will try all of your suggestions one by one. – owl Jul 25 '12 at 23:03
  • @Jon Yeah, missed the `sort` on the data. Good point about not materialising a list. – Mark Gemmill Jul 25 '12 at 23:04
0

Depending on the size of data and complexity... you may want to consider using pandas - info at http://pandas.pydata.org/ and available on PyPi.

Note however, this is likely over-kill, but thought I'd throw it in to the mix.

from pandas import DataFrame

# rows is processed from string in the OP
rows = [['Rabbit', ' Male'], ['Dog', ' Male'], ['Rabbit', ' Female'], ['Cat', ' Male'], ['Cat', ' Male'], ['Dog', ' Female'], ['Dog', ' Male'], ['Turtle', ' Male']]

df = pandas.DataFrame(rows, columns=['animal', 'gender'])

>>> df.groupby('animal').agg(len)
        gender
animal        
Cat          2
Dog          3
Rabbit       2
Turtle       1

>>> df.groupby(['animal', 'gender']).agg(len)
animal  gender 
Cat      Male      2
Dog      Female    1
         Male      2
Rabbit   Female    1
         Male      1
Turtle   Male      1
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • Thank you for sharing! Do you know if there is a way to overcome actually printing out the combination in "rows" in your code? Actual data I have has hundreds of "animals" with 16 columns... – owl Jul 25 '12 at 23:18
  • @owl Just assign the result to a variable... `pandas` is based on `numpy` arrays, so if you're familiar with that, you've got the ability to do numerical computations efficiently... Bit of a learning curve, but worth it... – Jon Clements Jul 25 '12 at 23:24
  • Thank you for introducing this! I am trying out all the answers I got and have not reached yours but I will try! – owl Jul 25 '12 at 23:35