Reading csv file row-by-row in pandas

Question

I have large piece of data that is problematic to load entirely to memory so I have decided to read it row-by-row, picking desired data, making transformations etc. and then clearing variables and pick another row.

It works fine while I am using csv.reader.

    source_file = open(path_to_source_file, encoding="utf8", newline='')
    spamreader = csv.reader(source_file , delimiter=";", quotechar='"')

    for id, row in enumerate(spamreader):
        variable1 = row[0]
        variable2 = row[1]
        variable3 = row[2]
        variable4 = row[3]
        ...
        variable25 = row[26]
        ...
        if id%1000000==0: print(f"Number of processed rows: {id}")

However, I am afraid that in the future some columns may be added/removed/swapped and adressing them by row[id] is not "safe" option.

I was thinking if there is possibility to load them with pandas because dataframe allows addressing by columname:

    for id, row in df.iterrows():
        variable1 = row['column0']
        variable2 = row['column1']
        variable3 = row['column2']
        variable4 = row['column3']
        ...
        variable25 = row['column25']
        ...
        if id%1000000==0: print(f"Number of processed rows: {id}")

However I wasn't able to make it work in comparable speed.

I have tried use chunksize=1, but even "pass" took very long time to execute:

with pd.read_csv(path_to_source_file, sep=";", quotechar='"', chunksize=1) as reader:
    for chunk in reader:
        pass

I've decided also to try nested fors and picking bigger chunks and then iterating over rows:

with pd.read_csv(path_to_source_file, sep=";", quotechar='"', chunksize=10000) as reader:
    for chunk in reader:
        for id, row in chunk.iterrows():
            pass

but this is also really slow if compared to read_csv version.

So,to summarize my question: Is there any way to load data from csv file with pandas row-by-row to get comparatable speed to csv.reader?

I hope you have seen this answer https://stackoverflow.com/a/55557758 — sharhp, Mar 07 '22 at 13:15
I have seen it some time ago. However, i am not trying to iterate over dataframe to make some changes in the data, instead i am trying to prevent getting that big dataframe loaded to my memory entirely to begin with. Thats true - i have "iterrows()" in my last piece of code but this is made only because loading data in chunks with size 1 is even slower than that. — Salamander Krajza, Mar 07 '22 at 13:25

score 2 · Answer 1 · answered Mar 07 '22 at 13:38

2

the csv module also provides the DictReader method.

reader = csv.DictReader(csv_file)
print(reader.fieldnames)

by default, columns names are inferred from the first row, alternatively you can specify what they should be by passing a sequence with the fieldnames parameter.

python csv module

answered Mar 07 '22 at 13:38

el_oso

1,021
6
10

Thanks for your suggestion. It works just like intended - i can assign data by column names without thinking about their order. However - sadly - it is still little slower than csv.reader. It is still faster than anything i was trying in pandas (pandas took above 5x more time) but still - it cost some performence so I have decided to manualy read column names and stick to csv.reader. – Salamander Krajza Mar 08 '22 at 12:08

Salamander Krajza · Answer 2 · 2022-03-08T12:55:49.300

After checking @el_oso answer to use DictReader I have tested how fast it works.

To summarize my test:

csv.reader took 1min 57s to load part of my data.
csv.DictReader took 2min 32s to load the same data.

Because i don't want to lose performance I've decided to make some work-around with csv.reader to get column list while reading first row:

    source_file = open(path_to_source_file, encoding="utf8", newline='')
    spamreader = csv.reader(source_file , delimiter=";", quotechar='"')

    for id, row in enumerate(spamreader):
        if id==0:
            cols={}
            for col_id, column in enumerate(row):
                cols[column]=col_id
        else:
            variable1 = row[cols['column0']]
            variable2 = row[cols['column1']]
            variable3 = row[cols['column2']]
            variable4 = row[cols['column3']]
            ...
            variable25 = row[cols['column24']]
            ...
            if id%1000000==0: print(f"Number of processed rows: {id}")

Using this method on same data took 1min54s (which is similar to csv.reader)

However, I am still open for any faster solutions if there will be any ideas.

Reading csv file row-by-row in pandas

2 Answers2