-1

I have a very large CSV file with the following structure:

category,value
A,1
A,4
B,2
A,1
B,3
...

What I need are two lists. The first list contains all values from category A, the seconds list contains all values from category B.

A working solution:

import csv

list_a = []
list_b = []

with open('my_file.csv', mode='r') as f:
    reader = csv.DictReader(f)

    for line in reader:
        if line['category'] == 'A':
            list_a.append(line['value'])
        if line['category'] == 'B':
            list_b.append(line['value'])

Since the CSV file is so large, I would like to avoid the expensive append calls. Is there a more efficient way?

martineau
  • 119,623
  • 25
  • 170
  • 301
Elias Strehle
  • 1,722
  • 1
  • 21
  • 34
  • have you tried using `pandas`? – Dan Oct 23 '19 at 15:36
  • 1
    In my experience, `pandas` eats memory for breakfast ... the CSV file is 400MB large and I am afraid the overhead of creating a `pandas.DataFrame` would be huge. – Elias Strehle Oct 23 '19 at 15:39
  • 1
    you can process large data files in chunks if memory is an issue, see: https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas – Dan Oct 23 '19 at 15:40
  • the pythonic way with least overhead https://stackoverflow.com/a/8010133/8560382, Consider using queue instead of list? – chrisckwong821 Oct 23 '19 at 16:32

2 Answers2

2
import pandas as pd
df = pd.read_csv('my_file.csv')
list_a = df.loc[df['category']=='A', 'value'].values.tolist()
list_b = df.loc[df['category']=='B', 'value'].values.tolist()
Suraj Motaparthy
  • 520
  • 1
  • 5
  • 12
0

I would suggest to apply collections.defaultdict in your case.
Though it implies one .append call (accumulating lists for each category), it would be much convenient container in potential cases where there could be more than 2 categories. The dict will allow you to keep values for any number of categories:

from collections import defaultdict
import csv

with open('file.csv') as f:
    reader = csv.DictReader(f)
    category_dict = defaultdict(list)

    for line in reader:
        category_dict[line['category']].append(line['value'])

Sample output:

print(category_dict['A'])   # ['1', '4', '1']
print(category_dict['B'])   # ['2', '3']
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105