How to efficiently sort values from CSV file into multiple lists?

Question

I have a very large CSV file with the following structure:

category,value
A,1
A,4
B,2
A,1
B,3
...

What I need are two lists. The first list contains all values from category A, the seconds list contains all values from category B.

A working solution:

import csv

list_a = []
list_b = []

with open('my_file.csv', mode='r') as f:
    reader = csv.DictReader(f)

    for line in reader:
        if line['category'] == 'A':
            list_a.append(line['value'])
        if line['category'] == 'B':
            list_b.append(line['value'])

Since the CSV file is so large, I would like to avoid the expensive append calls. Is there a more efficient way?

In my experience, `pandas` eats memory for breakfast ... the CSV file is 400MB large and I am afraid the overhead of creating a `pandas.DataFrame` would be huge. — Elias Strehle, Oct 23 '19 at 15:39
you can process large data files in chunks if memory is an issue, see: https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas — Dan, Oct 23 '19 at 15:40
the pythonic way with least overhead https://stackoverflow.com/a/8010133/8560382, Consider using queue instead of list? — chrisckwong821, Oct 23 '19 at 16:32

score 2 · Answer 1 · answered Oct 23 '19 at 15:38

2

import pandas as pd
df = pd.read_csv('my_file.csv')
list_a = df.loc[df['category']=='A', 'value'].values.tolist()
list_b = df.loc[df['category']=='B', 'value'].values.tolist()

answered Oct 23 '19 at 15:38

Suraj Motaparthy

520
1
5
12

score 0 · Answer 2 · answered Oct 23 '19 at 16:11

I would suggest to apply collections.defaultdict in your case.
Though it implies one .append call (accumulating lists for each category), it would be much convenient container in potential cases where there could be more than 2 categories. The dict will allow you to keep values for any number of categories:

from collections import defaultdict
import csv

with open('file.csv') as f:
    reader = csv.DictReader(f)
    category_dict = defaultdict(list)

    for line in reader:
        category_dict[line['category']].append(line['value'])

Sample output:

print(category_dict['A'])   # ['1', '4', '1']
print(category_dict['B'])   # ['2', '3']

How to efficiently sort values from CSV file into multiple lists?

2 Answers2