1

I'm working on a function that takes in a filename of a CSV and converts each line to a dictionary and then returns a list of the dictionaries created (to be able to iterate through and organize in later functions. I've gotten it to do what I want by doing the following but feel there's got to be a better way. Any suggestions for improvement?

import re

def import_incidents(filename):
    """Imports CSV and returns list of dictionaries for each incident"""
    with open(filename, 'r') as file:
        data = file.read()
        data = data.split('\n')
        list_of_data = []
        headers = True
        for line in data:
            line = line.split('","')
            if headers == True:
                #Skip header and set to false
                headers = False
            elif len(line) == 1 or line[3] == '':
                #File always has a 1 lenth final line, skip it.
                #Events can leave blank policies, skip those too.
                pass
            else:
                temp_dict = {}
                temp_dict['id'] = re.sub('"', '', line[0])
                temp_dict['time'] = re.sub('GMT-0600','',line[1])
                temp_dict['source'] = line[2]
                temp_dict['policy'] = line[3]
                temp_dict['destination'] = line[5]
                temp_dict['status'] = line[10]
                list_of_data.append(temp_dict)

return list_of_data

print(import_incidents('Incidents (Yesterday Only).csv'))

Sample of CSV contents:

"ID","Incident Time","Source","Policies","Channel","Destination","Severity","Action","Maximum Matches","Transaction Size","Status",
"9511564","29 Dec. 2015, 08:33:59 AM GMT-0600","Doe, John","Encrypted files","HTTPS","blah.blah.com","Medium","Permitted","0","47.7 KB","Closed - Authorized",
"1848446","29 Dec. 2015, 08:23:36 AM GMT-0600","Smith, Joe","","HTTP","google.com","Low","Permitted","0","775 B","Closed"
user987654321
  • 303
  • 6
  • 17

2 Answers2

7

You have reinvented the csv.DictReader() class, I'm afraid:

import csv

def import_incidents(filename):
    with open(filename, 'r', newline='') as file:
        reader = csv.DictReader(file)
        for row in reader:
            if not row or not row['Policies']:
                continue
            row['Incident Time'] = re.sub('GMT-0600', '', row['Incident Time'])
            yield row

This relies on the header row for the dictionary keys. You can define your own dictionary keys with the fieldnames argument to DictReader() (the fieldnames field is matched, in order, to the columns in the file), but then the first row in the file is still read like any other row. You can use the next() function to skip rows (see Skip the headers when editing a csv file using Python).

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
1

Original task: List of dicts

You can use pandas. It is fast and can do it in a few lines;

import pandas as pd
df = pd.read_csv('incidents.csv')
df['Incident Time'] = df['Incident Time'].str.replace('GMT-0600', '')
list_of_data = df.dropna(subset=['Policies']).to_dict(orient='records')

Now list_of_data contains:

[{'Action': 'Permitted',
  'Channel': 'HTTPS',
  'Destination': 'blah.blah.com',
  'ID': 9511564,
  'Incident Time': '29 Dec. 2015, 08:33:59 AM ',
  'Maximum Matches': 0,
  'Policies': 'Encrypted files',
  'Severity': 'Medium',
  'Source': 'Doe, John',
  'Status': 'Closed - Authorized',
  'Transaction Size': '47.7 KB',
  'Unnamed: 11': nan}]

The .dropna(subset='Policies') removes the lines that have NaNs in the column Policies, i.e. missing values.

Pandas-only solution:

If you don't want the list of dicts, keep the dataframe:

df = pd.read_csv('incidents.csv', parse_dates=[1]).dropna(subset=['Policies'])

This reads the Incident Time as very convenient datetime64[ns]object. The dataframe looks like this:

enter image description here

Mike Müller
  • 82,630
  • 20
  • 166
  • 161
  • Would there be any additional advantages to working with pandas? – user987654321 Dec 30 '15 at 22:19
  • Yes. You don`t need to convert the dataframe into a list of dicts in the first place. It si way more powerful than a plain dictionary or list. It is also really fast when used in the right way. Lots of it is written in Cython for speed. – Mike Müller Dec 30 '15 at 22:22
  • What's the .dropna(subset='Policies') for? – user987654321 Feb 11 '16 at 20:59
  • Also... have a strange character that is picked up in reading the .csv.. instead of the "ID" i get a \ufeff"ID" from the file need to input. Some sort of special character inserted (my guess). Any way to have this be stripped off? – user987654321 Feb 11 '16 at 21:04
  • Your file has a byteorder mark (BOM). Try to open the file with the encoding `utf-8-sig`: `pd.read_csv('incidents.csv', encoding='utf-8-sig')`. – Mike Müller Feb 14 '16 at 11:43