Python - reading a csv and grouping data by a column

Question

I am working with a csv file with 3 columns that looks like this:

timeStamp, value, label
15:22:57, 849, CPU pid=26298:percent
15:22:57, 461000, JMX MB
15:22:58, 28683, Disks I/O
15:22:58, 3369078, Memory pid=26298:unit=mb:resident
15:22:58, 0, JMX 31690:gc-time
15:22:58, 0, CPU pid=26298:percent
15:22:58, 503000, JMX MB

The label column contains distinct values (say a total of 5), which include spaces, colons and other special characters.

What I am trying to achieve is to plot time against each metric (either on the same plot or on separate ones). I can do this with matplotlib, but I first need to group the [timeStamps, value] pairs according to the 'label'.

I looked into the csv.DictReader to get the labels and the itertools.groupby to group by the 'label', but I am struggling to do this in a proper 'pythonic' way.

Any suggestion?

Well, given that my input samples are all mixed, I will probably have to separate the [times] and [values] for each [label]. My initial approach was to read the labels, then read each line and store each [time,value] pair based on the label. But this seems like a very inefficient way, which is why I wanted to look into dictionaries and iterators. — Argyrios Tzakas, Apr 25 '13 at 09:45

score 6 · Accepted Answer · answered Apr 25 '13 at 09:33

6

You don't need groupby; you want to use collections.defaultdict to collect series of [timestamp, value] pairs keyed by label:

from collections import defaultdict
import csv

per_label = defaultdict(list)

with open(inputfilename, 'rb') as inputfile:
    reader = csv.reader(inputfile)
    next(reader, None)  # skip the header row

    for timestamp, value, label in reader:
        per_label[label.strip()].append([timestamp.strip(), float(value)])

Now per_label is a dictionary with labels as keys, and a list of [timestamp, value] pairs as values; I've stripped off whitespace (your input sample has a lot of extra whitespace) and turned the value column into floats.

For your (limited) input sample that results in:

{'CPU pid=26298:percent': [['15:22:57', 849.0], ['15:22:58', 0.0]],
 'Disks I/O': [['15:22:58', 28683.0]],
 'JMX 31690:gc-time': [['15:22:58', 0.0]],
 'JMX MB': [['15:22:57', 461000.0], ['15:22:58', 503000.0]],
 'Memory pid=26298:unit=mb:resident': [['15:22:58', 3369078.0]]}

answered Apr 25 '13 at 09:33

Martijn Pieters

1,048,767
296
4,058
3,343

This worked like a charm! Thanks My file may end up having more than 3 columns, but I think I can handle this. I am assuming I just need to add: for timestamp, value, label, in reader: – Argyrios Tzakas Apr 25 '13 at 10:17
@ArgyriosTzakas: You can also just do `for row in reader`, then use indices, `row[1]`, `row[5]`, etc. `csv.DictReader()` would replace the list with a dict so you can do `row['timeStamp']` instead of `row[0]`; leave out the `next()` call if you switch to `reader.DictReader`. – Martijn Pieters Apr 25 '13 at 10:28
Thanks Martijn. On a related topic, is there a clever way of splitting my dictionary based on the keys, without having to hard-code? For example, I can do this l1, l2, l3 = per_label.values() but what if I have more than 3 keys? – Argyrios Tzakas Apr 25 '13 at 11:52
@ArgyriosTzakas: Why do you need to unpack the values? By unpacking to 3 variables you are already hardcoding variable names. Just loop over the dictionary: `for key in per_label:` or `for value in per_label.values():`, etc., or store all values in a list: `values_list = per_label.values()` then loop over that somewhere else. – Martijn Pieters Apr 25 '13 at 13:48

score 5 · Answer 2 · answered Apr 25 '13 at 10:26

You can try pandas which provide a nice structure to dealing with data.

Read the csv to the DataFrame

In [123]: import pandas as pd

In [124]: df = pd.read_csv('test.csv', skipinitialspace=True)

In [125]: df
Out[125]: 
  timeStamp    value                              label
0  15:22:57      849              CPU pid=26298:percent
1  15:22:57   461000                             JMX MB
2  15:22:58    28683                          Disks I/O 
3  15:22:58  3369078  Memory pid=26298:unit=mb:resident
4  15:22:58        0                  JMX 31690:gc-time
5  15:22:58        0              CPU pid=26298:percent
6  15:22:58   503000                             JMX MB

Group the DataFrame by label

In [154]: g =  df.groupby('label')

Now you can get what you want

In [155]: g.get_group('JMX MB')
Out[155]:
  timeStamp   value   label
1  15:22:57  461000  JMX MB
6  15:22:58  503000  JMX MB

Very interesting. I will have a look into Pandas as I will be doing many similar data manipulations. Thanks! — Argyrios Tzakas, Apr 25 '13 at 10:29
Can you add how to write the grouped CSV to file? Not quite trivial: http://stackoverflow.com/q/10373660/1333493 — Nemo, Nov 06 '15 at 15:13

score 2 · Answer 3 · answered Apr 25 '13 at 09:32

You can use numpy.loadtxt:

import numpy as np
result = np.loadtxt('MYFILE', usecols=(0, 1, 2), 
          dtype=[('time', 'S8'), ('values', np.uint), ('label', 'S33')], 
          delimiter=', ')

This will load your table into a structured array, where time is saved as a string of 8 characters ('S8'), values as unsigned integer and the label as a string with max 33 characters ('S33', you may have to adjust this size). Then you can index you values by the type:

>>> print result['values']
[    849  461000   28683 3369078       0       0  503000]

And filter according to label, if you want:

>>> print result['values'][result['label'] == 'JMX MB']
[461000 503000]

To convert the times from strings to floats, you can use pylab's date datestr2num and supply it as a converter to loadtxt:

import pylab
result = np.loadtxt('MYFILE', usecols=(0, 1, 2), 
           dtype=[('time', np.float), ('values', np.uint), ('label', 'S33')],
           delimiter=', ', converters={0: pylab.datestr2num})

Python - reading a csv and grouping data by a column

3 Answers3

Linked