0

I have a comma-separated file (from a third party) in which each line starts and ends with a space, the fields are quoted with a doublequote, and the file ends with a line with only a space.

 "first_name";"last_name" 
 "John";"Doe" 
 "Anita";"Doe"  

I try to read this with the following code.

import csv
import json

def read_csv(filename):
    result = []
    with open(filename, 'r', encoding='utf-8') as f:
        csv_reader = csv.reader(f, delimiter=';', quotechar='"')
        for line_index, line in enumerate(csv_reader):
            if line_index == 0:
                header = line
                continue
            result.append(dict(zip(header, line)))
    return result

if __name__ == '__main__':
    contents = read_csv('test.txt')
    print(json.dumps(contents, indent=4, sort_keys=4))

This is my expected result:

[
    {
        "first_name": "John",
        "last_name ": "Doe "
    },
    {
        "first_name": "Anita",
        "last_name ": "Doe "
    }
]

However, it always takes the doublequotes as part of the first column, due to the leading spaces, plus it takes the last line also into account. This is the result I get:

[
    {
        " \"first_name\"": " \"John\"",
        "last_name ": "Doe "
    },
    {
        " \"first_name\"": " \"Anita\"",
        "last_name ": "Doe "
    },
    {
        " \"first_name\"": " "
    }
]

How can I get rid of these leading and trailing spaces before the csv is parsed? The answer here shows how to remove spaces from fields after it is read, but that wouldn't be good here, since it's not the contents of the fields that I want to change, but the fields themselves.

By the way: I am using Python 3.5.

EDIT

I am skipping empty lines now using the following code:

# Skip empty lines
line = [column.strip() for column in line]
if not any(line):
    continue
physicalattraction
  • 6,485
  • 10
  • 63
  • 122

1 Answers1

2

You can use skipinitialspace=True and use a csv.DictReader (which assumes the first row is a header and creates a dict for you of name->value instead of manually doing it yourself) instead, eg:

with open(filename) as fin:
    csvin = csv.DictReader(fin, delimiter=';', skipinitialspace=True)
    result = list(csvin)

Alternatively, if only rows with some value should be considered (ie, the last row with no values, or even iterim blanks row should be filtered out), you can use:

result = [row for row in csvin if any(row.values())]

Which'll give you:

[{'first_name': 'John', 'last_name ': 'Doe '},
 {'first_name': 'Anita', 'last_name ': 'Doe '}]

And the result of that using json.dumps(result, indent=4, sort_keys=4)) is:

[
    {
        "first_name": "John",
        "last_name ": "Doe "
    },
    {
        "first_name": "Anita",
        "last_name ": "Doe  "
    }
]
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • Thanks, that already helps with the doublequote inside the fields. It still gives me a third element with `first_name=""` and `last_name=None` though. – physicalattraction Jun 16 '17 at 14:23
  • That's probably because SO removes the last line of my sample CSV, which is empty except for a space. – physicalattraction Jun 16 '17 at 14:25
  • @physicalattraction if embedded new lines are not expected within quotes strings, then you can filter out blank lines prior to feeding it into the reader, otherwise, you'll need to filter on whether a row is valid or not using what info you have in the row itself. (or just ignore the last line...) – Jon Clements Jun 16 '17 at 14:28
  • I was afraid of that. I have added my solution to that in my question. – physicalattraction Jun 16 '17 at 14:32
  • 1
    @physicalattraction I've updated the answer to include a probably smarter way to handle blank rows. That can be trailing row(s) or genuinely blank rows in the middle which aren't part of having embedded new lines within quoted strings. – Jon Clements Jun 16 '17 at 14:36