How to pre-process data before pandas.read_csv()

Question

I have a slightly broken CSV file that I want to pre-process before reading it with pandas.read_csv(), i.e. do some search/replace on it.

I tried to open the file and and do the pre-processing in a generator, that I then hand over to read_csv():

    def in_stream():
    with open("some.csv") as csvfile:
        for line in csvfile:
            l = re.sub(r'","',r',',line)
            yield l

    df = pd.read_csv(in_stream())

Sadly, this just throws a

ValueError: Invalid file path or buffer object type: <class 'generator'>

Although, when looking at Panda's source, I'd expect it to be able to work on iterators, thus generators.

I only found this [article] (Using a custom object in pandas.read_csv()), outlining how to wrap a generator into a file-like object, but it seems to only work on files in byte-mode.

So in the end I'm looking for a pattern to build a pipeline that opens a file, reads it line-by-line, allows pre-processing and then feeds it into e.g. pandas.read_csv().

@JonClements I think the constructor doesn't work on iterators. — Andrey Portnoy, Sep 03 '18 at 17:56
@JonClements This does consume the input from the generator, but doesn't parse the fields like read_csv() would. — MerlinDE, Sep 04 '18 at 09:17

score 5 · Answer 1 · answered Sep 05 '18 at 08:53

After further investigation of Pandas' source, it became apparent, that it doesn't simply require an iterable, but also wants it to be a file, expressed by having a read method (is_file_like() in inference.py).

So, I built a generator the old way

class InFile(object):
def __init__(self, infile):
    self.infile = open(infile)

def __next__(self):
    return self.next()

def __iter__(self):
    return self

def read(self, *args, **kwargs):
    return self.__next__()

def next(self):
    try:
        line: str = self.infile.readline()
        line = re.sub(r'","',r',',line) # do some fixing
        return line
    except:
        self.infile.close()
        raise StopIteration

This works in pandas.read_csv():

df = pd.read_csv(InFile("some.csv"))

To me this looks super complicated and I wonder if there is any better (→ more elegant) solution.

Andrey Portnoy · Answer 2 · 2018-09-03T18:14:23.743

3

Here's a solution that will work for smaller CSV files. All lines are first read into memory, processed, and concatenated. This will probably perform badly for larger files.

import re
from io import StringIO
import pandas as pd

with open('file.csv') as file:
    lines = [re.sub(r'","', r',', line) for line in file]

df = pd.read_csv(StringIO('\n'.join(lines)))

edited Sep 03 '18 at 18:14

answered Sep 03 '18 at 18:05

Andrey Portnoy

1,430
15
24

Thank you for your suggestion. Alas, I'm looking for an approach that works on huge files, too. The idea is being able to build a complete processing pipeline, streaming data from a source through several processors to its final destination. – MerlinDE Sep 04 '18 at 05:59

How to pre-process data before pandas.read_csv()

2 Answers2

Linked