Combine two csv-files by key in nested loops only works in the first iteration

Question

I have two csv-Files with a common key which is for file1 in the second column and for file2 in the first column.

I want to write a new csv-file that features columns from both files (not all columns from both files just the ones I need).

This is what I tried:

import numpy as np
import csv
import os

output_dir = "<output_dir>"
file1 = "f1.csv"
file2 = "f2.csv"

path1 = os.path.join(output_dir, file1)
path2 = os.path.join(output_dir, file2)

file3 = "fd_labels_processed.csv"
output_file = os.path.join(output_dir, file3)

with open(path1, 'r') as f1, open(path2, 'r') as f2, \
        open(output_file, 'w+', newline='') as f3:
    f1_reader = csv.reader(f1, delimiter=',')
    f2_reader = csv.reader(f2, delimiter=',')
    header_f1 = []
    header_f1 = next(f1_reader) # reading the next line after header of csv file.

    header_f2 = []
    header_f2 = next(f2_reader) # reading the next line after header of csv file.
    count = 0
    writer = csv.writer(f3, delimiter=',') #preparing the file f3 for writing the file.

    writer.writerow(["ATTRIBUTE_1", "ATTRIBUTE_2", "ATTRIBUTE_3", "ATTRIBUTE_4", "ATTRIBUTE_5", "ATTRIBUTE_6", "ATTRIBUTE_7", "ATTRIBUTE_8", "ATTRIBUTE_9"])

    for row_f1 in f1_reader: # looking each row from csv file f1
        for row_f2 in f2_reader: # looking for each row from csv file f2
            if row_f1[1] == row_f2[0]: #checking the condition; worse case Time complexity o(n2)
                if (row_f1[3] == row_f2[2] and row_f1[4] == row_f2[3] and row_f1[5] == row_f2[4] and row_f1[6] == row_f2[5]):
                    print(count)
                    writer = csv.writer(f3)
                    row_f2.append(row_f1[9])
                    row_f2.append(row_f1[11])
                    row_f2.append(row_f1[12])
                    writer.writerows([row_f2])
                    count +=1

But for some reason it takes the first line (after the header) from f1_reader, iterates through all lines from f2_reader once and then just stops after that. So the outer for-loop just stops after the first line.

Does this answer your question? [Why can't I iterate twice over the same iterator? How can I "reset" the iterator or reuse the data?](https://stackoverflow.com/questions/25336726/why-cant-i-iterate-twice-over-the-same-iterator-how-can-i-reset-the-iterator) — mkrieger1, Jul 01 '23 at 20:04
`f2_reader` is exhausted after the first line of `f1_reader`. — mkrieger1, Jul 01 '23 at 20:04
In case you want to iterate over the two files in parallel, not in nested loops, use `zip`. — mkrieger1, Jul 01 '23 at 20:05

Zach Young · Answer 1 · 2023-07-03T20:33:23.430

To join two CSVs on a common column you first need to read one CSV and store its rows by the common column, then read the second CSV and update the rows from the first read.

I mocked up these two CSVs:

input1.csv
==========
| ID | Name | Day |
|----|------|-----|
| 1  | Foo  | Sun |
| 2  | Bar  | Mon |
| 3  | Baz  | Wed |

input2.csv
==========
| ID | Week |
|----|------|
| 3  | 15   |
| 2  | 10   |
| 1  | 5    |

If I want a final, merged CSV (minus the Day column), like:

| ID | Name | Week |
|----|------|------|
| 1  | Foo  | 5    |
| 2  | Bar  | 10   |
| 3  | Baz  | 15   |

Then I need some code like the following:

import csv

rows: dict[str, dict[str, str]] = {}

with open("input1.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)

    for row in reader:
        id_ = row["ID"]
        rows[id_] = row


with open("input2.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)

    for row in reader:
        id_ = row["ID"]
        if id_ not in rows:
            continue
        rows[id_].update(row)

with open("output.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["ID", "Name", "Week"], extrasaction="ignore")
    writer.writeheader()
    writer.writerows(rows.values())

The data structure I store all information in is a dict with its keys as the common column values (ID, in my example) and the dicts values are themselves dicts that represent a complete row.

After reading input1.csv, rows looks like:

{
    "1": {"ID": "1", "Name": "Foo", "Day": "Sun"},
    "2": {"ID": "2", "Name": "Bar", "Day": "Mon"},
    "3": {"ID": "3", "Name": "Baz", "Day": "Wed"},
}

After reading input2.csv and updating the the previous rows with new values, rows looks like:

{
    "1": {"ID": "1", "Name": "Foo", "Day": "Sun", "Week": "5"},
    "2": {"ID": "2", "Name": "Bar", "Day": "Mon", "Week": "10"},
    "3": {"ID": "3", "Name": "Baz", "Day": "Wed", "Week": "15"},
}

Finally, I use the csv module's DictWriter to write back the values of the rows var. I tell the DictWriter I only want the fieldnames ID, Name, and Week, and then I tell it to ignore any extra keys/fieldnames it might find in the rows as it writes them out (i.e., "ignore the Day column"):

writer = csv.DictWriter(f, fieldnames=["ID", "Name", "Week"], extrasaction="ignore")

Which CSV you decide to read first might depend if one CSV has more rows/keys than the other:

Do you want just the intersection of both CSVs? Then read the smaller CSV first and leave the "if ID not in rows continue" logic.

Do you want all rows, with some rows having empty columns if they didn't exist in the other CSV? Like:

input2.csv
==========
| ID | Week |
|----|------|
| 4  | 20   |
| 3  | 15   |
| 2  | 10   |
| 1  | 5    |

Then:

...
with open("input2.csv", newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)

for row in reader:
    id_ = row["ID"]
    if id_ not in rows:
        rows[id_] = row
    else:
        rows[id_].update(row)
...

And now output looks like:

| ID  | Name | Week |
| --- | ---- | ---- |
| 1   | Foo  | 5    |
| 2   | Bar  | 10   |
| 3   | Baz  | 15   |
| 4   |      | 20   |

If you don't need to write the program yourself and can use something that's already written and tested, I recommend GoCSV's join command. For your case file1 will be the left file and file2 will be the right file, and the command would look like:

gocsv join -c 2,1 file1 file2

to peform the default inner join on the second column from the left file and the first column of the right file.

Combine two csv-files by key in nested loops only works in the first iteration

1 Answers1