-1

I have the following sample set of rows. One of the rows contains values taken from say file1. The other row having the same first cell value has its values extracted from file2. I had generated two output files separately and merged the two files into a single csv file. Now the problem lies in merging these matching rows. I need to merge them into a single row such that the values from both the rows are taken into consideration. The example is shown below.

After merging file1 and file2 into a single output.csv file:

Name, Age, Height, Place

ABC, 18, -, XYZ

DEF, -, 170, LMN

ABC, -, 165, -

Desired Output:

ABC, 18, 165, XYZ (Fill in missing values after merging)

I would like to approach this using plain python (not pandas or any similar libraries.)

  • What is the issue, exactly? What is your question? – AMC Jan 26 '20 at 20:56
  • 1
    @stovfl Nope. I have already merged the two files into a single file. I need to merge rows containing same first cell values and ensure that the single row contains information from both the rows (if missing). – Karthik Nayak Jan 27 '20 at 06:07
  • @AMC Merging two rows containing same first cell value into a single row so it contains information from both the rows. – Karthik Nayak Jan 27 '20 at 06:08

2 Answers2

0

If memory allows it, you should read in your second CSV file and import it into a hash of tuples, i.e. you would have, in your example,

{
    ...
    ABC: ( 'ABC', '-', 165, '-' )
    ...
}

Then when reading the first CSV file, you would check whether the desired key (here ABC) existed in the dictionary; if it did, you can walk together the two lists, taking whichever value is not '-' from either.

LSerni
  • 55,617
  • 10
  • 65
  • 107
0

This is my first Python progam (I work exclusively in a Windows environment, so PowerShell is my scripting language of choice), but I think this script meets your needs.

It uses a hash table to manage the input csv file data, i.e., it is used to dedup name values and consolidate non-hyphen data values

c:\temp\input.csv:

Name, Age, Height, Place
ABC, 18, -, XYZ
DEF, -, 170, LMN
ABC, -, 165, -

Program output to console

enter image description here

The hash table key is the "Name" values. The hash table values are ClsPerson objects each of which hold name, age, height, and place. The ClsPerson class update method ignores input values that equal a hyphen (this is the placeholder value you used in your sample csv file) unless that is the one-and only value provided.

In the end, the hash table values (ClsPerson objects) include the original csv file headings and the consolidated non-hyphen values for each of the 4 fields (if a non-hyphen value was provided). This code assumes that a Name value will always be available (and not be a hyphen). I'll leave it up to you to code a csv file writer to output the collection of ClsPerson objects to a new output csv file

csvNameIndex = 0
csvAgeIndex = 1
csvHeightIndex = 2
csvPlaceIndex = 3

myHashTable = {}

class ClsPerson:

    def __init__(self, name, age, height, place):
        self.name = name.strip()
        self.age = age.strip()
        self.height = height.strip()
        self.place = place.strip()

    def update(self, age, height, place):

        csvEmptyFieldChar = '-'

        if age.strip() != csvEmptyFieldChar:
            self.age = age.strip()

        if height.strip() != csvEmptyFieldChar:
            self.height = height.strip()

        if place.strip() != csvEmptyFieldChar:
            self.place = place.strip()

import csv  
with open('c:\\temp\\input.csv', newline='') as csvfile:

    spamreader = csv.reader(csvfile, delimiter=',')

    for row in spamreader:

        if row[csvNameIndex] in myHashTable:
            htRowPerson = (myHashTable.get(row[csvNameIndex]))['Person']
            htRowPerson.update(row[csvAgeIndex], row[csvHeightIndex], row[csvPlaceIndex])
        else:
            objPerson = ClsPerson(row[csvNameIndex], row[csvAgeIndex], row[csvHeightIndex], row[csvPlaceIndex])
            dictRow = {'Name': row[0], 'Person': objPerson}
            myHashTable[row[csvNameIndex]] = dictRow

for name in myHashTable:
    htRowPerson = (myHashTable.get(name))['Person']
    print (htRowPerson.name, ' ', htRowPerson.age, ' ', htRowPerson.height, ' ', htRowPerson.place)
VA systems engineer
  • 2,856
  • 2
  • 14
  • 38
  • I modified the snippet accordingly for a much larger csv file containing many matching rows. Thanks. Marking as solved. EDIT: I should've mentioned the version of python I'm using. The newline argument for open returns an error in Python 2.6. I removed it and it does not affect the working of the code. – Karthik Nayak Jan 27 '20 at 10:14
  • Oh, yeah - sorry - I think I was using Python 3. Glad it worked out for you – VA systems engineer Jan 27 '20 at 13:10