0

I have a file with the following input data:

       IN   OUT
data1  2.3  1.3
data2  0.1  2.1
data3  1.5  2.8
dataX  ...  ...

There are thousands of such files and each has the same data1, data2, data3, ..., dataX I'd like to count number of each value for each data and column from all files. Example:

In file 'data1-IN' (filename)

2.3 - 50    (times)
0.1 - 233   (times)
... - ...   (times)

In file 'data1-OUT' (filename)

2.1 - 1024 (times)
2.8 - 120  (times)
... - ...  (times)

In file 'data2-IN' (filename)

0.4 - 312    (times)
0.3 - 202   (times)
... - ...   (times)

In file 'data2-OUT' (filename)

1.1 - 124 (times)
3.8 - 451  (times)
... - ...  (times)

In file 'data3-IN' ...

Which Python data structure would be the best to count such data ? I wanted to use multidimensional dictionary but I am struggling with KeyErrors etc.

pb100
  • 736
  • 3
  • 11
  • 20

2 Answers2

3

You really want to use collections.Counter, perhaps contained in a collections.defaultdict:

import collections
import csv

counts = collections.defaultdict(collections.Counter)

for filename in files:
    for line in csv.reader(open(filename, 'rb')):
         counts[filename + '-IN' ][line[1]] += 1
         counts[filename + '-OUT'][line[2]] += 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Python 2.6.4 (r264:75706, Apr 2 2012, 20:24:27) [C] on sunos5 Type "help", "copyright", "credits" or "license" for more information. >>> import collections >>> counts = collections.defaultdict(collections.Counter) Traceback (most recent call last): File "", line 1, in AttributeError: 'module' object has no attribute 'Counter' – pb100 Oct 03 '12 at 20:08
  • From http://stackoverflow.com/questions/5079790/python-how-to-update-value-of-key-value-pair-in-nested-dictionary?lq=1 : dictionary = collections.defaultdict(lambda: collections.defaultdict(int)) What is a difference between these two definitions ? – pb100 Oct 07 '12 at 18:47
  • And this http://docs.python.org/library/collections.html says that defaultdict is available since 2.5 – pb100 Oct 07 '12 at 18:49
  • @przemol: A `Counter` offers more functionality than a `defaultdict` with an `int` value, such as retrieving the top counts, and combining multiple counters in various ways. Read the linked documentation for more details. – Martijn Pieters Oct 07 '12 at 19:13
  • which version of python should I have to be able to use collections.Counter ? – pb100 Oct 07 '12 at 20:54
  • `collections.Counter` was added in Python 2.7. The backported version on ActiveState I linked to in these comments runs on 2.5 and 2.6 as well. – Martijn Pieters Oct 08 '12 at 07:09
1

I have recently started using the Pandas data frame. It has a CSV reader and makes slicing and dicing data very simple.

Tooblippe
  • 3,433
  • 3
  • 17
  • 25