If you want to just join each CSV row based on ID, then don't use a DictReader
. Dictionary keys must be unique, but you are producing rows with multiple EXECUTION_STATUS
and RELEASE
, etc. columns.
Moreover, how will you handle ids where one or two of the input CSV files has no input?
Use regular readers and store each row keyed by filename. Make fieldnames
a list as well:
import csv
from collections import defaultdict
result = defaultdict(dict)
filenames = ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = ["TEST_ID"]
for csvfile in filenames:
with open(csvfile, 'rb') as infile:
reader = csv.reader(infile)
headers = next(reader, []) # read first line, headers
fieldnames.extend(headers[1:]) # all but the first column name
lengths[csvfile] = len(headers) - 1 # keep track of how many items to backfill
for row in reader:
result[row[0]][csvfile] = row[1:] # all but the first column
with open("out.csv", "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(fieldnames)
for id_ in sorted(result):
row = [id_]
data = result[id_]
for filename in filenames:
row.extend(data.get(filename) or [''] * lengths[filename])
writer.writerow(row)
This code stores rows per filename, so that you can later build a whole row from each file but still fill in blanks if the row was missing in that file.
The alternative would be to make column names unique by appending a number or filename to each; that way your DictReader
approach could work too.
The above gives:
TEST_ID, RELEASE , COMPILE_STATUS , EXECUTION_STATUS, RELEASE , COMPILE_STATUS , EXECUTION_STATUS, RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR1.1 , COMPILE_FAILED , EXECUTION_FAILED, FR2.0 , COMPILE_FAILED , EXECUTION_FAILED, FR2.5 , COMPILE_FAILED , EXECUTION_FAILED
If you need to base your order on one of the input files, then omit that input file from the first reading loop; instead, read that file while writing the output loop and use its first column to look up the other file data:
import csv
from collections import defaultdict
result = defaultdict(dict)
filenames = ("FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = []
for csvfile in filenames:
with open(csvfile, 'rb') as infile:
reader = csv.reader(infile)
headers = next(reader, []) # read first line, headers
fieldnames.extend(headers[1:]) # all but the first column name
lengths[csvfile] = len(headers) - 1 # keep track of how many items to backfill
for row in reader:
result[row[0]][csvfile] = row[1:] # all but the first column
with open("FR1.1.csv", "rb") as infile, open("out.csv", "wb") as outfile:
reader = csv.reader(infile)
headers = next(reader, []) # read first line, headers
writer = csv.writer(outfile)
writer.writerow(headers + fieldnames)
for row in sorted(reader):
data = result[row[0]]
for filename in filenames:
row.extend(data.get(filename) or [''] * lengths[filename])
writer.writerow(row)
This does mean that any TEST_ID
values extra in the other two files are ignored.
If you wanted to preserve all TEST_ID
s then I'd use collections.OrderedDict()
; new TEST_ID
s found in the later files will be tacked onto the end:
import csv
from collections import OrderedDict
result = OrderedDict(dict)
filenames = ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = ["TEST_ID"]
for csvfile in filenames:
with open(csvfile, 'rb') as infile:
reader = csv.reader(infile)
headers = next(reader, []) # read first line, headers
fieldnames.extend(headers[1:]) # all but the first column name
lengths[csvfile] = len(headers) - 1 # keep track of how many items to backfill
for row in reader:
if row[0] not in result:
result[row[0]] = {}
result[row[0]][csvfile] = row[1:] # all but the first column
with open("out.csv", "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(fieldnames)
for id_ in result:
row = [id_]
data = result[id_]
for filename in filenames:
row.extend(data.get(filename) or [''] * lengths[filename])
writer.writerow(row)
The OrderedDict
maintains entries in insertion order; so FR1.1.csv
sets the order for all keys, but any FR2.0.csv
ids not found in the first file are appended to the dictionary at the end, and so on.
For Python versions < 2.7, either install a backport (see OrderedDict for older versions of python) or track the ID order manually with:
import csv
from collections import defaultdict
result = defaultdict(dict)
filenames = ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = ["TEST_ID"]
ids, seen = [], set()
for csvfile in filenames:
with open(csvfile, 'rb') as infile:
reader = csv.reader(infile)
headers = next(reader, []) # read first line, headers
fieldnames.extend(headers[1:]) # all but the first column name
lengths[csvfile] = len(headers) - 1 # keep track of how many items to backfill
for row in reader:
id_ = row[0]
# track ordering
if id_ not in seen:
seen.add(id_)
ids.append(id_)
result[id_][csvfile] = row[1:] # all but the first column
with open("out.csv", "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(fieldnames)
for id_ in ids:
row = [id_]
data = result[id_]
for filename in filenames:
row.extend(data.get(filename) or [''] * lengths[filename])
writer.writerow(row)