CSV to List of Dictionaries - Better Way?

Question

I'm working on a function that takes in a filename of a CSV and converts each line to a dictionary and then returns a list of the dictionaries created (to be able to iterate through and organize in later functions. I've gotten it to do what I want by doing the following but feel there's got to be a better way. Any suggestions for improvement?

import re

def import_incidents(filename):
    """Imports CSV and returns list of dictionaries for each incident"""
    with open(filename, 'r') as file:
        data = file.read()
        data = data.split('\n')
        list_of_data = []
        headers = True
        for line in data:
            line = line.split('","')
            if headers == True:
                #Skip header and set to false
                headers = False
            elif len(line) == 1 or line[3] == '':
                #File always has a 1 lenth final line, skip it.
                #Events can leave blank policies, skip those too.
                pass
            else:
                temp_dict = {}
                temp_dict['id'] = re.sub('"', '', line[0])
                temp_dict['time'] = re.sub('GMT-0600','',line[1])
                temp_dict['source'] = line[2]
                temp_dict['policy'] = line[3]
                temp_dict['destination'] = line[5]
                temp_dict['status'] = line[10]
                list_of_data.append(temp_dict)

return list_of_data

print(import_incidents('Incidents (Yesterday Only).csv'))

Sample of CSV contents:

"ID","Incident Time","Source","Policies","Channel","Destination","Severity","Action","Maximum Matches","Transaction Size","Status",
"9511564","29 Dec. 2015, 08:33:59 AM GMT-0600","Doe, John","Encrypted files","HTTPS","blah.blah.com","Medium","Permitted","0","47.7 KB","Closed - Authorized",
"1848446","29 Dec. 2015, 08:23:36 AM GMT-0600","Smith, Joe","","HTTP","google.com","Low","Permitted","0","775 B","Closed"

Have a look at csv.DictReader: https://docs.python.org/2/library/csv.html#csv.DictReader — Simeon Visser, Dec 30 '15 at 21:17

Martijn Pieters · Accepted Answer · 2015-12-30T21:28:57.303

7

You have reinvented the csv.DictReader() class, I'm afraid:

import csv

def import_incidents(filename):
    with open(filename, 'r', newline='') as file:
        reader = csv.DictReader(file)
        for row in reader:
            if not row or not row['Policies']:
                continue
            row['Incident Time'] = re.sub('GMT-0600', '', row['Incident Time'])
            yield row

This relies on the header row for the dictionary keys. You can define your own dictionary keys with the fieldnames argument to DictReader() (the fieldnames field is matched, in order, to the columns in the file), but then the first row in the file is still read like any other row. You can use the next() function to skip rows (see Skip the headers when editing a csv file using Python).

edited Dec 30 '15 at 21:28

answered Dec 30 '15 at 21:17

Martijn Pieters

1,048,767
296
4,058
3,343

Ha! Great. Good to know it's been done before now. – user987654321 Dec 30 '15 at 21:21
@PadraicCunningham: I documented my assumption about the header. The 'empty line' can be skipped with `if not row or not row['policy']: continue`. – Martijn Pieters Dec 30 '15 at 21:22
Getting: – user987654321 Dec 30 '15 at 21:25
@PadraicCunningham: `csv.DictReader()` handles quotes properly. – Martijn Pieters Dec 30 '15 at 21:25
1

@user987654321: yes, I produced a generator function, not a list. Use iteration (like a `for` loop), or call `list()` on that generator object to produce a list again. – Martijn Pieters Dec 30 '15 at 21:25
@user987654321: however, unless you need random access to any of the rows at any time, I'd use iteration and not read the whole file into memory. – Martijn Pieters Dec 30 '15 at 21:26
Working like a charm and feeling like a fool. Thanks @MartijnPieters. – user987654321 Dec 30 '15 at 21:31

Mike Müller · Answer 2 · 2016-02-14T11:34:15.610

1

Original task: List of dicts

You can use pandas. It is fast and can do it in a few lines;

import pandas as pd
df = pd.read_csv('incidents.csv')
df['Incident Time'] = df['Incident Time'].str.replace('GMT-0600', '')
list_of_data = df.dropna(subset=['Policies']).to_dict(orient='records')

Now list_of_data contains:

[{'Action': 'Permitted',
  'Channel': 'HTTPS',
  'Destination': 'blah.blah.com',
  'ID': 9511564,
  'Incident Time': '29 Dec. 2015, 08:33:59 AM ',
  'Maximum Matches': 0,
  'Policies': 'Encrypted files',
  'Severity': 'Medium',
  'Source': 'Doe, John',
  'Status': 'Closed - Authorized',
  'Transaction Size': '47.7 KB',
  'Unnamed: 11': nan}]

The .dropna(subset='Policies') removes the lines that have NaNs in the column Policies, i.e. missing values.

Pandas-only solution:

If you don't want the list of dicts, keep the dataframe:

df = pd.read_csv('incidents.csv', parse_dates=[1]).dropna(subset=['Policies'])

This reads the Incident Time as very convenient datetime64[ns]object. The dataframe looks like this:

edited Feb 14 '16 at 11:34

answered Dec 30 '15 at 21:59

Mike Müller

82,630
20
166
161

Would there be any additional advantages to working with pandas? – user987654321 Dec 30 '15 at 22:19
Yes. You don`t need to convert the dataframe into a list of dicts in the first place. It si way more powerful than a plain dictionary or list. It is also really fast when used in the right way. Lots of it is written in Cython for speed. – Mike Müller Dec 30 '15 at 22:22
What's the .dropna(subset='Policies') for? – user987654321 Feb 11 '16 at 20:59
Also... have a strange character that is picked up in reading the .csv.. instead of the "ID" i get a \ufeff"ID" from the file need to input. Some sort of special character inserted (my guess). Any way to have this be stripped off? – user987654321 Feb 11 '16 at 21:04
Your file has a byteorder mark (BOM). Try to open the file with the encoding `utf-8-sig`: `pd.read_csv('incidents.csv', encoding='utf-8-sig')`. – Mike Müller Feb 14 '16 at 11:43

CSV to List of Dictionaries - Better Way?

2 Answers2

Original task: List of dicts

Pandas-only solution: