Python csv merge multiple files with different columns

Question

I hope somebody can help me with this issue.

I have about 20 csv files (each file with its headers), each of this files has hundreds of columns. My problem is related to merging those files, because a couple of them have extra columns. I was wondering if there is an option to merge all those files in one adding all the new columns with related data without corrupting the other files.

So far I used I used the awk terminal command:

awk '(NR == 1) || (FNR > 1)' *.csv > file.csv

to merge removing the headers from all the files expect from the first one. I got this from my previous question Merge multiple csv files into one

But this does not solve the issue with the extra column.

EDIT:

Here are some file csv in plain text with the headers.

file 1

"@timestamp","@version","_id","_index","_type","ad.(fydibohf23spdlt)/cn","ad.</o","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","ad.totalComputers","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationUserPrivileges","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-07-27 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

file2

"@timestamp","@version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-07-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

file3

"@timestamp","@version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-08-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

file4

"@timestamp","@version","_id","_index","_type","ad.EventRecordID","ad.InitiatorID","ad.InitiatorType","ad.Opcode","ad.ProcessID","ad.TargetSid","ad.ThreadID","ad.Version","ad.agentZoneName","ad.analyzedBy","ad.command","ad.completed","ad.customerName","ad.databaseTable","ad.description","ad.destinationHosts","ad.destinationZoneName","ad.deviceZoneName","ad.expired","ad.failed","ad.loginName","ad.maxMatches","ad.policyObject","ad.productVersion","ad.requestUrlFileName","ad.severityType","ad.sourceHost","ad.sourceIp","ad.sourceZoneName","ad.systemDeleted","ad.timeStamp","agentAddress","agentHostName","agentId","agentMacAddress","agentReceiptTime","agentTimeZone","agentType","agentVersion","agentZoneURI","applicationProtocol","baseEventCount","bytesIn","bytesOut","categoryBehavior","categoryDeviceGroup","categoryDeviceType","categoryObject","categoryOutcome","categorySignificance","cefVersion","customerURI","destinationAddress","destinationDnsDomain","destinationHostName","destinationNtDomain","destinationProcessName","destinationServiceName","destinationTimeZone","destinationUserId","destinationUserName","destinationZoneURI","deviceAction","deviceAddress","deviceCustomDate1","deviceCustomDate1Label","deviceCustomIPv6Address3","deviceCustomIPv6Address3Label","deviceCustomNumber1","deviceCustomNumber1Label","deviceCustomNumber2","deviceCustomNumber2Label","deviceCustomNumber3","deviceCustomNumber3Label","deviceCustomString1","deviceCustomString1Label","deviceCustomString2","deviceCustomString2Label","deviceCustomString3","deviceCustomString3Label","deviceCustomString4","deviceCustomString4Label","deviceCustomString5","deviceCustomString5Label","deviceCustomString6","deviceCustomString6Label","deviceEventCategory","deviceEventClassId","deviceHostName","deviceNtDomain","deviceProcessName","deviceProduct","deviceReceiptTime","deviceSeverity","deviceVendor","deviceVersion","deviceZoneURI","endTime","eventId","eventOutcome","externalId","facility","facility_label","fileName","fileType","flexString1Label","flexString2","geid","highlight","host","message","name","oldFileHash","priority","reason","requestClientApplication","requestMethod","requestUrl","severity","severity_label","sort","sourceAddress","sourceHostName","sourceNtDomain","sourceProcessName","sourceServiceName","sourceUserId","sourceUserName","sourceZoneURI","startTime","tags","type"
2021-08-28 14:11:39,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Those are 4 of the 20 files, I included all the headers but no rows because they contain sensitive data.

When I run the script on those files, I can see that it writes the timestamp value. But when I run it against the original files (with a lot of data) all what it does, is writing the header and that's it.Please if you need some more info just let me know.

Once I run the script on the original file. This is what I get back

There are 20 rows (one for each file) but it doesn't write the content of each file. This could be related to the sniffing of the first line? because I think that is checking only the first line of the files and moves forward as in the script. So how is that in a small file, it manage to copy merge also the content?

It's still not clear whether you know the number of columns in advance, or whether you are looking for a solution which sniffs out the number of columns from all the files. Is in enough to examine the first line of each file? Do you need to cope with CSV quoting etc? — tripleee, Jul 28 '21 at 18:55
Sorry, I don't know the exact number of column. Roughly I can tell they are between 190 and 200. But eventually I can print that out. Regarding the quoting, as far as I now I shouldn't have any problem with that — Nayden Van, Jul 28 '21 at 19:03
To be clear - you want a python solution despite showing an awk script in your question and tagging it with awk, right? Please [edit] your question to clarify that and add a [mcve] with concise, testable sample input and expected output that demonstrates your problem and we can copy/paste to test with. — Ed Morton, Jul 28 '21 at 20:26
Thanks for the edit. However, the examples seem to be invalid; they contain header lines with a different number of fields than the actual data. — tripleee, Jul 29 '21 at 09:25
Adding a screen shot from Excel is dubious and useless at best. What does the actual resulting file look like? A hex dump maybe. But again, debugging an unrelated problem might be better addressed in a separate question. — tripleee, Jul 29 '21 at 09:57

Ed Morton · Answer 1 · 2021-07-29T11:59:41.540

Your question isn't clear, idk if you really want a solution in awk or python or either, and it doesn't have any sample input/output we can test with so it's a guess but is this what you're trying to do (using any awk in any shell on every Unix box)?

$ head file{1..2}.csv
==> file1.csv <==
1,2
a,b
c,d

==> file2.csv <==
1,2,3
x,y,z

$ cat tst.awk
BEGIN {
    FS = OFS = ","
    for (i=1; i<ARGC; i++) {
        if ( (getline < ARGV[i]) > 0 ) {
            if ( NF > maxNF ) {
                maxNF = NF
                hdr   = $0
            }
        }
    }
}
NR == 1 { print hdr }
FNR > 1 { NF=maxNF; print }

$ awk -f tst.awk file{1..2}.csv
1,2,3
a,b,
c,d,
x,y,z

See http://awk.freeshell.org/AllAboutGetline for details on when/how to use getline and it's associated caveats.

Alternatively with an assist from GNU head for -q:

$ cat tst.awk
BEGIN { FS=OFS="," }
NR == FNR {
    if ( NF > maxNF ) {
        maxNF = NF
        hdr   = $0
    }
    next
}
!doneHdr++ { print hdr }
FNR > 1 { NF=maxNF; print }

$ head -q -n 1 file{1..2}.csv | awk -f tst.awk - file{1..2}.csv
1,2,3
a,b,
c,d,
x,y,z

@tripleee thanks for the heads up (sorry, couldn't resist). I've updated my answer to say that requires GNU head. — Ed Morton, Jul 29 '21 at 12:00

tripleee · Accepted Answer · 2021-07-28T20:38:06.767

1

As already explained in your original question, you can easily extend the columns in Awk if you know how many to expect.

awk -F ',' -v cols=5 'BEGIN { OFS=FS }
FNR == 1 && NR > 1 { next }
NF<cols { for (i=NF+1; i<=cols; ++i) $i = "" }
1' *.csv >file.csv

I slightly refactored this to skip the unwanted lines with next rather than vice versa; this simplifies the rest of the script slightly. I also added the missing comma separator.

You can easily print the number of columns in each file, and just note the maximum:

awk -F , 'FNR==1 { print NF, FILENAME }' *.csv

If you don't know how many fields there are going to be in files you do not yet have, or if you need to cope with complex CSV with quoted fields, maybe switch to Python for this. It's not too hard to do the field number sniffing in Awk, but coping with quoting is tricky.

import csv
import sys

# Sniff just the first line from every file
fields = 0
for filename in sys.argv[1:]:
    with open(filename) as raw:
        for row in csv.reader(raw):
            # If the line is longer than current max, update
            if len(row) > fields:
                fields = len(row)
                titles = row
            # Break after first line, skip to next file
            break

# Now do the proper reading
writer = csv.writer(sys.stdout)
writer.writerow(titles)

for filename in sys.argv[1:]:
    with open(filename) as raw:
        for idx, row in enumerate(csv.reader(raw)):
            if idx == 0:
                next
        row.extend([''] * (fields - len(row)))
        writer.writerow(row)

This simply assumes that the additional fields go at the end. If the files could have extra columns between other columns, or columns in different order, you need a more complex solution (though not by much; the Python CSV DictReader subclass could do most of the heavy lifting).

Demo: https://ideone.com/S998l4

If you wanted to do the same type of sniffing in Awk, you basically have to specify the names of the input files twice, or do some nontrivial processing in the BEGIN block to read all the files before starting the main script.

edited Jul 28 '21 at 20:38

answered Jul 28 '21 at 19:24

tripleee

175,061
34
275
318

Thank you so much for your help. So I run a script to output the total columns I should expect, and the result is 141. After that I run your script, and the output of total column is 141 and seems to be in order swell. I do have just a question. When I open the generated `final.csv` I see only the headers, the rows are empty. The script has a writerow that seems is not writing them – Nayden Van Jul 28 '21 at 20:35
As you can see from the demo link, it works with a toy example. Perhaps you can provide a [mcve] with example files where it doesn't work? – tripleee Jul 29 '21 at 04:24
I did add some files. Thank you very much for your help – Nayden Van Jul 29 '21 at 09:05
Seems to work fine for me, https://ideone.com/5uZJQ7 -- or can you point out what's wrong here? Notice also the warning messages (I added a simple Awk snippet which attempts to validate the number of fields). – tripleee Jul 29 '21 at 09:21
That is really weird. In my case also works and it does write the rows. but when I run the code on my original files which has roughly 250k rows, the file outputted contains only the headers. I will share a screenshot – Nayden Van Jul 29 '21 at 09:48
Screenshots don't help at all. Try to investigate whether the files have funky line endings etc. See also [Are shell scripts sensitive to encoding and line endings?](https://stackoverflow.com/questions/39527571/are-shell-scripts-sensitive-to-encoding-and-line-endings) – tripleee Jul 29 '21 at 09:52
Thank you so much tripleee. I am investigating now. Sorry for bothering you – Nayden Van Jul 29 '21 at 09:54
No worries. Perhaps (accept this solution and) post a new question with more debugging details about your specific scenario. – tripleee Jul 29 '21 at 09:56
If I only could I would not only accept the answer but make a statue in your honour. Thank you so much – Nayden Van Jul 29 '21 at 09:59

Python csv merge multiple files with different columns

2 Answers2