1

I have two CSV files

input.csv:

id,scenario,data1,data2,result
1,s1,300,400,"{s1,not added}"
2,s2,500,101,"{s2 added}"
3,s3,600,202,

output.csv

id,result
1,"{s1,added}"
3,"{s3,added}"

I want to combine this two CSVs using Shell/Python scripting such that the output is as follows:

final_output.csv

id,scenario,data1,data2,result
1,s1,300,400,"{s1,added}"
2,s2,500,101,"{s2 added}"
3,s3,600,202,"{s3,added}"

Conditions: 1. column to join both csv is "id" column

  1. result column data if present in output.csv then override the value. If it is not present then keep as it is

Can you please help?

  • Why did you tag `python` and `csv` then? Remove it. If you can use python, then it's easy to do it pandas. – Mohammad Yusuf Jan 14 '17 at 04:42
  • 1
    Can you show some code that you have tried? We can help you debug it. – Sid Jan 14 '17 at 04:43
  • @nu11p01n73R Agree - while the question is not an _exact_ duplicate (dupe target has same number of rows in both files), the top answer covers everything required. – Benjamin W. Jan 14 '17 at 04:49
  • You can also do this using awk, http://stackoverflow.com/a/5467806/3150943 – nu11p01n73R Jan 14 '17 at 05:01
  • Thank you all for replying. @Sid I tried using join -t , -1 1 -2 1 test.csv output2.csv > final_output.csv . It doesn't work. Also I am changing my question a little bit.. Hey , MYGz Python would work too.. but my aim was getting through shell first. – Madhura Mhatre Jan 15 '17 at 20:36
  • The linked question is not a duplicate, because this question has an added requirement: to conditionally retain the LHS value if there's no RHS match, which `join` alone cannot do. `awk` is not a good supplement / alternative, because it doesn't understand double-quoted field values with embedded field separators. – mklement0 Jan 17 '17 at 13:07

2 Answers2

0

Here is a solution that uses bash builtins only. Put the following in a script file, make it executable, and run it in a directory where your two .csv files are located.

#!/bin/bash -ue

declare -A output_map

# Pattern representing 0 or more spacing characters
space="[[:space:]]*"

# Pattern for fields
field="$space([^,]*)$space"
last_field="$space(.*)$space"

# Build map of key/value in output.csv
while IFS= read -r line
do
  [[ "$line" =~ ^$field,$last_field$ ]] || continue
  key="${BASH_REMATCH[1]}"
  value="${BASH_REMATCH[2]}"
  output_map[$key]="$value"
done <"output.csv"

# Perform merge of the two files
while IFS= read -r line
do
  [[ "$line" =~ ^$field,$field,$field,$field,$last_field$ ]] || continue
  f1="${BASH_REMATCH[1]}"
  f2="${BASH_REMATCH[2]}"
  f3="${BASH_REMATCH[3]}"
  f4="${BASH_REMATCH[4]}"
  f5="${BASH_REMATCH[5]}"
  value="${output_map[$f1]-}"
  [[ -z "$value" ]] || f5="$value"
  echo "$f1,$f2,$f3,$f4,$f5"
done <"input.csv"

It is not especially compact, but should be relatively to figure out if you understand pattern matching in bash conditionals (the =~ operator).

Please note that lines that do not match the right format are ignored and that the line with the headers does not require any special handling.

If you have any question, let me know.

Fred
  • 6,590
  • 9
  • 20
  • 1
    It is better, and still habits die hard after spending years writing scripts with all uppercase variable names before stumbling on this recommendation. I have updated the code. – Fred Jan 17 '17 at 03:14
  • Just a heads-up for users stuck with older Bash versions, such as on macOS: use of associative arrays (`declare -A`) requires Bash v4+. – mklement0 Jan 17 '17 at 04:54
0

join will get you most of the way there - assuming the input files are already sorted by the join field - but the requirement of conditionally keeping the left input file's value requires additional work.

That additional work is complicated by the fact that some of your field values are double-quoted with embedded separators, which standard utilities such as awk and sed handle poorly.

A python-assisted solution:

join -t, -a1 input.csv output.csv | python -c '
import csv, sys
for row in csv.reader(sys.stdin):  
  if(len(row)>5):
    row[4] = row[5]
    del row[5:]
  row[4] = "\"" + row[4] + "\""
  print(",".join(row))
'

A perl-assisted solution:

join -t, -a1 input.csv output.csv | perl -MText::Parsewords -lne '
    my @flds = Text::ParseWords::parse_line(",", 1, $_);
    if ($#flds >= 5) { $flds[4] = $flds[5]; $#flds = 4 };
    print join(",", @flds);
  '
mklement0
  • 382,024
  • 64
  • 607
  • 775