merging two files with columns of that have different lengths and possible comments into CSV like file

Question

file1 looks like

# dsd
# dsd
1,2,5
2,3,5
1,2,5
2,3,5
3,4,5
3,4,5

file2 looks like

# s
1,2
1,2

I want to merge them to get

# dsd
# dsd
1,2,5,1,2
2,3,5,1,2
1,2,5,,
2,3,5,,
3,4,5,,
3,4,5,,

That is I want to keep the comment lines # from the first file after the comment lines, I want to paste columns from the second file, padding them to the column length of the first file. If there are any comment lines in the second file, ignore them.

I started with:

 paste $(grep -v '^#' file1) file2

but I get bash: /usr/bin/paste: Argument list too long

I guess this would be a job for awk but I am only familiar with single file processing and I have only found examples that deal with the same length files. Is there an easy way or one needs to go to longer bash script or python et al.?

to avoid the error you got, you need to use process substitution `<()` instead of command substitution `$()` - you'll still need to modify the cmd a lot to get your expected output though — Sundeep, Apr 19 '23 at 15:13

score 2 · Accepted Answer · answered Apr 19 '23 at 15:00

2

You may use this awk solution:

awk -v OFS=, '
NR == FNR {
   if (!/^#/)
      a[++i] = $0
   next
}
{
   if (/^#/)
      print
   else {
      ++NR2
      if (NR2 in a)
         print $0, a[NR2]
      else
         print $0,"",""
   }
}' file2 file1

# dsd
# dsd
1,2,5,1,2
2,3,5,1,2
1,2,5,,
2,3,5,,
3,4,5,,
3,4,5,,

answered Apr 19 '23 at 15:00

anubhava

761,203
64
569
643

It works with this simple example but interestingly, when i use it on my data, the line `print $0,""` does not print a new column, rather, it starts replacing `$0` from the first character, e.g. if `$0="abc,cde"`, `print $0,"",""` generates `,,c,cde`. Even more interesting, if instead of `$0`, I use `$1,$2,"",""` it generates correctly `abc,cde,,`. Do you know why? – atapaka Apr 19 '23 at 17:52
3

You have DOS line endings, see [why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it](https://stackoverflow.com/questions/45772525/why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it). – Ed Morton Apr 19 '23 at 17:53
@atapaka: Trust your problem got resolved with Ed's comment that you have DOS line endings. – anubhava Apr 19 '23 at 19:03
1

@anubhava Yes, that was the issue. – atapaka Apr 24 '23 at 23:29

score 2 · Answer 2 · answered Apr 19 '23 at 17:50

2

Using any awk:

$ cat tst.awk
BEGIN { FS=OFS="," }
FNR == 1 {
    lineNr = 0
    dflt = a[1]
    gsub("[^"FS"]+","",dflt)
}
/^#/ {
    if ( NR != FNR ) {
        print
    }
    next
}
{ ++lineNr }
NR == FNR {
    a[lineNr] = $0
    next
}
{ print $0, (lineNr in a ? a[lineNr] : dflt) }

$ awk -f tst.awk file2 file1
# dsd
# dsd
1,2,5,1,2
2,3,5,1,2
1,2,5,,
2,3,5,,
3,4,5,,
3,4,5,,

answered Apr 19 '23 at 17:50

Ed Morton

188,023
17
78
185

What does the part with `FNR == 1 ` do? Why is `lineNr` outside of `NR==FNR`, then it counts lines in both files - is it necessary? – atapaka Apr 19 '23 at 17:59
`FNR==1` is true for the first line in each file so at the start of each file it sets the `lineNr` to `0` for that file. `lineNr` is the non-# line number in each file so yes, it's necessary to increment it for each file. Note that it's not the count of lines across both files, it's the count of lines within each file. – Ed Morton Apr 19 '23 at 18:16
Oh, and the other thing being done in the `FNR==1` block is to create a default string of consecutive `,`s to print for lines present in file1 that weren't present in file2. That way you don't have to hard-code how many `,`s to print for the last 4 lines of file1, you just get as many as there were in the first non-# line of file2. – Ed Morton Apr 19 '23 at 18:23

aborruso · Answer 3 · 2023-04-20T16:22:17.377

1

Using the great Miller, paste, cat and grep, you could run

paste -d ',' <(grep -v '^#' file1.txt) <(grep -v '^#' file2.txt) | mlr --csv -N --ragged cat >output
<file1.txt grep -P '^#' | cat - output > tmp.txt && mv tmp.txt output

to get

# dsd
# dsd
1,2,5,1,2
2,3,5,1,2
1,2,5,,
2,3,5,,
3,4,5,,
3,4,5,,

The steps:

merge the two input files horizontally, removing the comments lines (via paste and grep);
add missing commas (via mlr);
add the comment lines of first file to the merged one (via grep and cat)

edited Apr 20 '23 at 16:22

answered Apr 20 '23 at 06:22

aborruso

4,938
3
23
40

1

You don't need the non-portable `-P` in `grep -P '^#'` as `^#` is just a BRE as grep uses by default. You also don't need a temp file if you do `{ grep '^#' file; paste ... | mlr...; } > output` – Ed Morton Apr 20 '23 at 12:01

score 0 · Answer 4 · answered Apr 20 '23 at 14:02

Here is a Ruby with the CSV module:

ruby -r csv -e '
f1=CSV.read(ARGV[0])
f2=CSV.read(ARGV[1]).select{|row| !row.join("")[/^\s*#/] }
f2=[""]*f1.slice_when{|a,b| b.to_s[/\d/]}.first.length+f2
f2c=f2.max_by{|row| row.length}.length
puts CSV.generate{|csv| 
    f1.zip(f2).each{|row| 
        if row.flatten.join("")[/^\s*#/] 
            csv<<row[0] 
        elsif row[-1].nil?
            csv<<row[0]+[nil]*f2c
        else 
            csv<<row.flatten
        end
    }
}
' file1 file2

This is not limited to the assumption that file2 is only 2 columns.

It DOES assume that file1 is the longer of the two files. Easily changed if that is not true.

Prints:

# dsd
# dsd
1,2,5,1,2
2,3,5,1,2
1,2,5,,
2,3,5,,
3,4,5,,
3,4,5,,

merging two files with columns of that have different lengths and possible comments into CSV like file

4 Answers4