1

I am trying read three csv files and wants to put output in single csv file by making first column as ID so it should not repeat as it's common in all input csv files. I have written some code but it's giving errors. I am not sure this is best way to perform my task.

code:

#! /usr/bin/python
import csv
from collections import defaultdict

result = defaultdict(dict)
fieldnames = ("ID")

for csvfile in ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv"):
    with open(csvfile, 'rb') as infile:
        reader = csv.DictReader(infile)
        for row in reader:
            id = row.pop("ID")
            for key in row:
                fieldnames.add(key) 
                result[id][key] = row[key]

    with open("out.csv", "w") as outfile:
    writer = csv.DictWriter(outfile, sorted(fieldnames))
    writer.writeheader()
    for item in result:
        result[item]["ID"] = item
        writer.writerow(result[item]

input csv files are listed below:

FR1.1.csv-->

TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR1.1 , COMPILE_FAILED , EXECUTION_FAILED

FR2.0.csv-->

TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR2.0 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR2.0 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR2.0 , COMPILE_FAILED , EXECUTION_FAILED

FR2.5.csv-->

TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR2.5 , COMPILE_FAILED , EXECUTION_FAILED

out.csv (required)-->

TEST_Id , RELEASE , COMPILE_STATUS , EXECUTION_STATUS , RELEASE , COMPILE_STATUS , EXECUTION_STATUS , RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR1.1 , COMPILE_FAILED , EXECUTION_FAILED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED

thanks to post best method to achieve above result.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Ram More
  • 81
  • 1
  • 10
  • 2
    And what errors are you seeing? Please do include the full traceback. – Martijn Pieters Oct 11 '13 at 06:53
  • 1
    And the indentation for your code sample is incorrect; presumably the second `with` statement is not indented that far? – Martijn Pieters Oct 11 '13 at 06:53
  • 1
    Sorting the fieldnames is probably not what you wanted to do; and you should only add fieldnames for the *first* row of each CSV file. – Martijn Pieters Oct 11 '13 at 06:54
  • @MartijnPieters Kindly requesting you to check my updated code and requirement and suggest me the way to achive this requirement. – Ram More Oct 23 '13 at 09:21
  • Instead of expanding your question to cover new problems, ask a *new question* instead. That way far more people get to see it too. I've reverted your edit; this specific question has already been answered. – Martijn Pieters Oct 23 '13 at 09:24
  • Could you please not vandalize posts like that? We can try and anonymize information here, but there is *hardly* anything useful or confidential here, is there? – Martijn Pieters Oct 28 '13 at 11:29

1 Answers1

2

If you want to just join each CSV row based on ID, then don't use a DictReader. Dictionary keys must be unique, but you are producing rows with multiple EXECUTION_STATUS and RELEASE, etc. columns.

Moreover, how will you handle ids where one or two of the input CSV files has no input?

Use regular readers and store each row keyed by filename. Make fieldnames a list as well:

import csv
from collections import defaultdict

result = defaultdict(dict)
filenames = ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = ["TEST_ID"]

for csvfile in filenames:
    with open(csvfile, 'rb') as infile:
        reader = csv.reader(infile)
        headers = next(reader, [])  # read first line, headers
        fieldnames.extend(headers[1:])  # all but the first column name
        lengths[csvfile] = len(headers) - 1  # keep track of how many items to backfill
        for row in reader:
            result[row[0]][csvfile] = row[1:]  # all but the first column

with open("out.csv", "wb") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(fieldnames)
    for id_ in sorted(result):
        row = [id_]
        data = result[id_]
        for filename in filenames:
            row.extend(data.get(filename) or [''] * lengths[filename])
        writer.writerow(row)

This code stores rows per filename, so that you can later build a whole row from each file but still fill in blanks if the row was missing in that file.

The alternative would be to make column names unique by appending a number or filename to each; that way your DictReader approach could work too.

The above gives:

TEST_ID, RELEASE , COMPILE_STATUS , EXECUTION_STATUS, RELEASE , COMPILE_STATUS , EXECUTION_STATUS, RELEASE , COMPILE_STATUS , EXECUTION_STATUS
FC/B_019.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_020.config , FR1.1 , COMPILE_PASSED , EXECUTION_PASSED, FR2.0 , COMPILE_PASSED , EXECUTION_PASSED, FR2.5 , COMPILE_PASSED , EXECUTION_PASSED
FC/B_021.config , FR1.1 , COMPILE_FAILED , EXECUTION_FAILED, FR2.0 , COMPILE_FAILED , EXECUTION_FAILED, FR2.5 , COMPILE_FAILED , EXECUTION_FAILED

If you need to base your order on one of the input files, then omit that input file from the first reading loop; instead, read that file while writing the output loop and use its first column to look up the other file data:

import csv
from collections import defaultdict

result = defaultdict(dict)
filenames = ("FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = []

for csvfile in filenames:
    with open(csvfile, 'rb') as infile:
        reader = csv.reader(infile)
        headers = next(reader, [])  # read first line, headers
        fieldnames.extend(headers[1:])  # all but the first column name
        lengths[csvfile] = len(headers) - 1  # keep track of how many items to backfill
        for row in reader:
            result[row[0]][csvfile] = row[1:]  # all but the first column

with open("FR1.1.csv", "rb") as infile, open("out.csv", "wb") as outfile:
    reader = csv.reader(infile)
    headers = next(reader, [])  # read first line, headers

    writer = csv.writer(outfile)
    writer.writerow(headers + fieldnames)

    for row in sorted(reader):
        data = result[row[0]]
        for filename in filenames:
            row.extend(data.get(filename) or [''] * lengths[filename])
        writer.writerow(row)

This does mean that any TEST_ID values extra in the other two files are ignored.

If you wanted to preserve all TEST_IDs then I'd use collections.OrderedDict(); new TEST_IDs found in the later files will be tacked onto the end:

import csv
from collections import OrderedDict

result = OrderedDict(dict)
filenames = ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = ["TEST_ID"]

for csvfile in filenames:
    with open(csvfile, 'rb') as infile:
        reader = csv.reader(infile)
        headers = next(reader, [])  # read first line, headers
        fieldnames.extend(headers[1:])  # all but the first column name
        lengths[csvfile] = len(headers) - 1  # keep track of how many items to backfill
        for row in reader:
            if row[0] not in result:
                result[row[0]] = {}
            result[row[0]][csvfile] = row[1:]  # all but the first column

with open("out.csv", "wb") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(fieldnames)
    for id_ in result:
        row = [id_]
        data = result[id_]
        for filename in filenames:
            row.extend(data.get(filename) or [''] * lengths[filename])
        writer.writerow(row)

The OrderedDict maintains entries in insertion order; so FR1.1.csv sets the order for all keys, but any FR2.0.csv ids not found in the first file are appended to the dictionary at the end, and so on.

For Python versions < 2.7, either install a backport (see OrderedDict for older versions of python) or track the ID order manually with:

import csv
from collections import defaultdict

result = defaultdict(dict)
filenames = ("FR1.1.csv", "FR2.0.csv", "FR2.5.csv")
lengths = {}
fieldnames = ["TEST_ID"]
ids, seen = [], set()

for csvfile in filenames:
    with open(csvfile, 'rb') as infile:
        reader = csv.reader(infile)
        headers = next(reader, [])  # read first line, headers
        fieldnames.extend(headers[1:])  # all but the first column name
        lengths[csvfile] = len(headers) - 1  # keep track of how many items to backfill
        for row in reader:
            id_ = row[0]
            # track ordering
            if id_ not in seen:
                seen.add(id_)
                ids.append(id_)
            result[id_][csvfile] = row[1:]  # all but the first column

with open("out.csv", "wb") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(fieldnames)
    for id_ in ids:
        row = [id_]
        data = result[id_]
        for filename in filenames:
            row.extend(data.get(filename) or [''] * lengths[filename])
        writer.writerow(row)
Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • @ Pieters i have tried your code header row is printed very much the way i want only fisrt value is wrong. It should be TEST_ID. All row values are not printed from all files, only one row is printed from each file. TEST_ID is also not printed in out.csv file. here is output of above code: .ID, RELEASE , COMPILE_STATUS , EXECUTION_STATUS, RELEASE , COMPILE_STATUS , EXECUTION_STATUS, RELEASE , COMPILE_STATUS , EXECUTION_STATUS , FR1.1 , COMPILE_FAILED , EXECUTION_FAILED, FR2.0 , COMPILE_FAILED , EXECUTION_FAILED, FR2.5 , COMPILE_FAILED , EXECUTION_FAILED – Ram More Oct 11 '13 at 08:57
  • @Pieters your edited code gives output correctly. Only the issue here is its printing ID as first column header it should be TEST_ID. It is not possible to print TEST_ID?? – Ram More Oct 11 '13 at 09:10
  • Also one problem is out file ID sequence is not matching with the infile ID. infile sequence is B_019, B_020 and B_021 but in out file it is printing it as B_021, B_019 and B_020. last row is printed first first row is in middle and second row is printed at last. can we keep the sequenec as input files have.Thanks!!! – Ram More Oct 11 '13 at 09:25
  • @RamMore: My code writes `ID` because your original code did. Simply replace `fieldnames = ['ID']` with the desired column name. – Martijn Pieters Oct 11 '13 at 09:26
  • @RamMore: What should the order be? Sorted on id, or based on the order of one of the CSV input files? Note that the latter is tricky if ids are missing from one of the files; those could end up at the end of the file in that case. – Martijn Pieters Oct 11 '13 at 09:27
  • @RamMore: updated to output the file sorted on `TEST_ID` now. – Martijn Pieters Oct 11 '13 at 09:29
  • @Pieters Thanks for your patient reply. Sorting order should be based on first input csv file(in this case FR1.1.csv). Can you please let me know how to add that type of sorting. – Ram More Oct 11 '13 at 10:07
  • @RamMore: and what happens to ids *not* present in the first input file? Your options are to ignore these or to write them at the end. – Martijn Pieters Oct 11 '13 at 10:07
  • @Pieters ID's will be same in all file most of the time. so we will sort based on list present in first file. if in case second file has more entries than first file then first file columns in out.csv will be blank and it should return second file values in proper order. if first file doesnt have any entry it should keep blank in out.csv and then it will print all values as per second files entry in sorted order as present in that file. if all these combinations are not possible keep at least first file sorting. hope entries in all file will be same all the time – Ram More Oct 11 '13 at 10:29
  • even above new code giving error to me. with open("FR1.1.csv", "rb") as infile, open("sortout.csv", "wb") as outfile: ^ SyntaxError: invalid syntax i am using python2.6. – Ram More Oct 11 '13 at 10:41
  • @RamMore: then nest the `with` statements. Indent the rest of the code to match. `with open(..) as ...:` -> indent -> `with open(..) as ...:` -> indent. – Martijn Pieters Oct 11 '13 at 10:44
  • @ for the new code i am getting below error: from collections import OrderedDict ImportError: cannot import name OrderedDict – Ram More Oct 11 '13 at 10:50
  • @RamMore: [OrderedDict for older versions of python](http://stackoverflow.com/q/1617078) – Martijn Pieters Oct 11 '13 at 10:54
  • @Pieters i am using redhat linux. bash: pip : command not found error i will google other ways is any. if you know please tell me. Thanks a lot for you help – Ram More Oct 11 '13 at 11:00
  • At some point I run out of time to retool this; retooled the ordereddict version to track ordering manually instead. – Martijn Pieters Oct 11 '13 at 11:07