1

I have data similar to the following in a file:

Name, Age, Sex, School, height, weight, id

Joe, 10, M, StThomas, 120, 20, 111

Jim, 9, M, StThomas, 126, 22, 123

Jack, 8, M, StFrancis, 110, 15, 145

Abel, 10, F, StFrancis, 128, 23, 166

The actual data might be 100 columns and a million rows.

What I am trying to do is create a dict in the following pattern:

school_data = {'StThomas': {'weight':[20,22], 'height': [120,126]},
               'StFrancis': {'weight':[15,23], 'height': [110,128]} }

Things I tried:

  1. Trial 1: (very expensive in terms of computation)

    school_names  = []
    for lines in read_data[1:]:
        data = lines.split('\t')
        school_names.append(data[3])
    
    school_names = set(school_names)
    
    for lines in read_data[1:]:
        for school in schools:
            if school in lines:
                print lines
    
  2. Trial 2:

    for lines in read_data[1:]:
        data = lines.split('\t')
        school_name = data[3]
        height = data[4]
        weight = data[5]
        id = data [6]
        x[id] = {school_name: (weight, height)}
    

The above two are methods in which I tried to proceed but did not get closer to the solution.

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
  • What are the other columns? Are they relevant for the computation or are you looking to do with those extra columns the same thing that you did with weight/height (group values by School)? – Cadu Oct 07 '16 at 14:04

1 Answers1

1

The easiest way to do this within the standard library is using existing tools, csv.DictReader and collections.defaultdict:

from collections import defaultdict
from csv import DictReader

data = defaultdict(lambda: defaultdict(list))  # *

with open(datafile) as file_:
    for row in DictReader(file_):
        data[row[' School'].strip()]['height'].append(int(row[' height']))
        data[row[' School'].strip()]['weight'].append(int(row[' weight']))

Note that the spaces in e.g. ' School' and the .strip() are necessary because of the spaces in the header row of your input file. Result:

>>> data
defaultdict(<function <lambda> at 0x10261c0c8>, {'StFrancis': defaultdict(<type 'list'>, {'weight': [15, 23], 'height': [110, 128]}), 'StThomas': defaultdict(<type 'list'>, {'weight': [20, 22], 'height': [120, 126]})})
>>> data['StThomas']['height']
[120, 126]

Alternatively, if you're planning to do further analysis, look into something like pandas and its DataFrame data structure.

* see Python defaultdict and lambda if this seems weird

Community
  • 1
  • 1
jonrsharpe
  • 115,751
  • 26
  • 228
  • 437