1

I have two csv-Files with a common key which is for file1 in the second column and for file2 in the first column.

I want to write a new csv-file that features columns from both files (not all columns from both files just the ones I need).

This is what I tried:

import numpy as np
import csv
import os

output_dir = "<output_dir>"
file1 = "f1.csv"
file2 = "f2.csv"

path1 = os.path.join(output_dir, file1)
path2 = os.path.join(output_dir, file2)

file3 = "fd_labels_processed.csv"
output_file = os.path.join(output_dir, file3)

with open(path1, 'r') as f1, open(path2, 'r') as f2, \
        open(output_file, 'w+', newline='') as f3:
    f1_reader = csv.reader(f1, delimiter=',')
    f2_reader = csv.reader(f2, delimiter=',')
    header_f1 = []
    header_f1 = next(f1_reader) # reading the next line after header of csv file.

    header_f2 = []
    header_f2 = next(f2_reader) # reading the next line after header of csv file.
    count = 0
    writer = csv.writer(f3, delimiter=',') #preparing the file f3 for writing the file.

    writer.writerow(["ATTRIBUTE_1", "ATTRIBUTE_2", "ATTRIBUTE_3", "ATTRIBUTE_4", "ATTRIBUTE_5", "ATTRIBUTE_6", "ATTRIBUTE_7", "ATTRIBUTE_8", "ATTRIBUTE_9"])

    for row_f1 in f1_reader: # looking each row from csv file f1
        for row_f2 in f2_reader: # looking for each row from csv file f2
            if row_f1[1] == row_f2[0]: #checking the condition; worse case Time complexity o(n2)
                if (row_f1[3] == row_f2[2] and row_f1[4] == row_f2[3] and row_f1[5] == row_f2[4] and row_f1[6] == row_f2[5]):
                    print(count)
                    writer = csv.writer(f3)
                    row_f2.append(row_f1[9])
                    row_f2.append(row_f1[11])
                    row_f2.append(row_f1[12])
                    writer.writerows([row_f2])
                    count +=1

But for some reason it takes the first line (after the header) from f1_reader, iterates through all lines from f2_reader once and then just stops after that. So the outer for-loop just stops after the first line.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
ZsYoR
  • 15
  • 3
  • 2
    Does this answer your question? [Why can't I iterate twice over the same iterator? How can I "reset" the iterator or reuse the data?](https://stackoverflow.com/questions/25336726/why-cant-i-iterate-twice-over-the-same-iterator-how-can-i-reset-the-iterator) – mkrieger1 Jul 01 '23 at 20:04
  • `f2_reader` is exhausted after the first line of `f1_reader`. – mkrieger1 Jul 01 '23 at 20:04
  • In case you want to iterate over the two files in parallel, not in nested loops, use `zip`. – mkrieger1 Jul 01 '23 at 20:05

1 Answers1

1

To join two CSVs on a common column you first need to read one CSV and store its rows by the common column, then read the second CSV and update the rows from the first read.

I mocked up these two CSVs:

input1.csv
==========
| ID | Name | Day |
|----|------|-----|
| 1  | Foo  | Sun |
| 2  | Bar  | Mon |
| 3  | Baz  | Wed |

input2.csv
==========
| ID | Week |
|----|------|
| 3  | 15   |
| 2  | 10   |
| 1  | 5    |

If I want a final, merged CSV (minus the Day column), like:

| ID | Name | Week |
|----|------|------|
| 1  | Foo  | 5    |
| 2  | Bar  | 10   |
| 3  | Baz  | 15   |

Then I need some code like the following:

import csv

rows: dict[str, dict[str, str]] = {}

with open("input1.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)

    for row in reader:
        id_ = row["ID"]
        rows[id_] = row


with open("input2.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)

    for row in reader:
        id_ = row["ID"]
        if id_ not in rows:
            continue
        rows[id_].update(row)

with open("output.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["ID", "Name", "Week"], extrasaction="ignore")
    writer.writeheader()
    writer.writerows(rows.values())

The data structure I store all information in is a dict with its keys as the common column values (ID, in my example) and the dicts values are themselves dicts that represent a complete row.

After reading input1.csv, rows looks like:

{
    "1": {"ID": "1", "Name": "Foo", "Day": "Sun"},
    "2": {"ID": "2", "Name": "Bar", "Day": "Mon"},
    "3": {"ID": "3", "Name": "Baz", "Day": "Wed"},
}

After reading input2.csv and updating the the previous rows with new values, rows looks like:

{
    "1": {"ID": "1", "Name": "Foo", "Day": "Sun", "Week": "5"},
    "2": {"ID": "2", "Name": "Bar", "Day": "Mon", "Week": "10"},
    "3": {"ID": "3", "Name": "Baz", "Day": "Wed", "Week": "15"},
}

Finally, I use the csv module's DictWriter to write back the values of the rows var. I tell the DictWriter I only want the fieldnames ID, Name, and Week, and then I tell it to ignore any extra keys/fieldnames it might find in the rows as it writes them out (i.e., "ignore the Day column"):

writer = csv.DictWriter(f, fieldnames=["ID", "Name", "Week"], extrasaction="ignore")

Which CSV you decide to read first might depend if one CSV has more rows/keys than the other:

  • Do you want just the intersection of both CSVs? Then read the smaller CSV first and leave the "if ID not in rows continue" logic.

  • Do you want all rows, with some rows having empty columns if they didn't exist in the other CSV? Like:

    input2.csv
    ==========
    | ID | Week |
    |----|------|
    | 4  | 20   |
    | 3  | 15   |
    | 2  | 10   |
    | 1  | 5    |
    

    Then:

    ...
    with open("input2.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    
    for row in reader:
        id_ = row["ID"]
        if id_ not in rows:
            rows[id_] = row
        else:
            rows[id_].update(row)
    ...
    

    And now output looks like:

    | ID  | Name | Week |
    | --- | ---- | ---- |
    | 1   | Foo  | 5    |
    | 2   | Bar  | 10   |
    | 3   | Baz  | 15   |
    | 4   |      | 20   |
    

If you don't need to write the program yourself and can use something that's already written and tested, I recommend GoCSV's join command. For your case file1 will be the left file and file2 will be the right file, and the command would look like:

gocsv join -c 2,1 file1 file2

to peform the default inner join on the second column from the left file and the first column of the right file.

Zach Young
  • 10,137
  • 4
  • 32
  • 53