0

So I have a file named testingFruits.csv with the following columns:

name,value_id,size
apple,1,small
mango,2,small
banana,3,medium
watermelon,4,large

I also have an associative array that stores the following data:

fruitSizes[apple] = xsmall
fruitSizes[mango] = small
fruitSizes[banana] = medium
fruitSizes[watermelon] = xlarge

Is there anyway I can update the 'size' column within the file based on the data within the associative array for each value in the 'name' column?

I've tried using awk but I had no luck. Here's a sample of what I tried to do:

awk -v t="${fruitSizes[*]}" 'BEGIN{n=split(t,arrayval,""); ($1 in arrayval) {$3=arrayval[$1]}' "testingFruits.csv"

I understand this command would get the bash defined array fruitSizes, do a split on all the values, then check if the first column (name) is within the fruitSizes array. If it is, then it would update the third column (size) with the value found in fruitSizes for that specific name.

Unfortunately this gives me the following error:

Argument list too long

This is the expected output I'd like in the same testingFruits.csv file:

name,value_id,size
apple,1,xsmall
mango,2,small
banana,3,medium
watermelon,4,xlarge

One edge case I'd like to handle is the presence of duplicate values in the name column with different values for the value_id and size columns.

  • FYI having a bash associative array as your starting point is probably a bad idea as they're slow and non-portable and make the rest of your script harder to implement, you should instead be using awk to read whatever input you're populating that array from. – Ed Morton Sep 09 '21 at 20:22

1 Answers1

1

If you want to stick to an awk script, pass the array via stdin to avoid running into ARG_MAX issues.

Since your array is associative, listing only the values ${fruitSizes[@]} is not sufficient. You also need the keys ${!fruitSizes[@]}. pr -2 can pair the keys and values in one line.
This assumes that ${fruitSizes[@]} and ${!fruitSizes[@]} expand in the same order, and your keys and values are free of the field separator (, in this case).

printf %s\\n "${!fruitSizes[@]}" "${fruitSizes[@]}" | pr -t -2 -s, |
awk -F, -v OFS=, 'NR==FNR {a[$1]=$2; next} $1 in a {$3=a[$1]} 1' - testingFruits.csv

However, I'm wondering where the array fruitSizes comes from. If you read it from a file or something like that, it would be easier to leave out the array altogether and do everything in awk.

Socowi
  • 25,550
  • 3
  • 32
  • 54
  • What is the significance of {a[$1]=$2; next} ? – confusedcoder21 Sep 09 '21 at 19:27
  • @confusedcoder21 `a` is the awk-version of `fruitSizes`. We need the mentioned rule to populate that array. `NR==FNR` activates only at the first "file" (`-` stands for stdin). `a[$1]=$2` stores the keys (`$1`) and values (`$2`) from `fruitSizes` in the awk-array `a`. `next` skips all other awk rules and goes to the next line. Therefore, the part `$1 in a {$3=a[$1]} 1` is only executed for the 2nd file. `1` is a shorthand for `{print}`. – Socowi Sep 09 '21 at 19:36
  • So $1 refers to the keys in the fruitSizes and $2 refers to the values. And later $3 refers to the size column in the file? – confusedcoder21 Sep 09 '21 at 19:42
  • How would this change if I had more columns? Say the name column was column #7 and the size column is #13? – confusedcoder21 Sep 09 '21 at 19:45
  • Exactly. The meaning of the columns changes from "file" (stdin `-`) to file (`testingFruits.csv`). If you want to adapt this script, just **ignore the part `NR==FNR {a[$1]=$2; next}`**. That's just the "magic" initialization of the array. What comes after that can be altered however you like. If the name is `$7` and the size is `$13`, use `$7 in a {$13=a[$7]} 1`. The part before that stays the same. – Socowi Sep 09 '21 at 19:52
  • That makes a lot more sense, thank you! Unfortunately, I can't seem to get it working within the bash script. The file does not get updated for some reason. – confusedcoder21 Sep 09 '21 at 20:01
  • Would this fail if we there are duplicate values in the name column with different values for the other columns? – confusedcoder21 Sep 09 '21 at 20:25
  • Of course this does not update the file; the script in your question did neither. It just *prints* the altered file to the terminal. To write the changes to *another* file, append `> updated.csv` at the very end. ¶ Regarding duplicated names: Duplicated names in `testingFruits.csv` are ok. Each of the matching lines will be updated. The array `fruitSizes` cannot contain duplicates in the first place (if you meant that, your whole approach of using an array is not viable). – Socowi Sep 09 '21 at 20:38
  • For sure, thank you! One thing I am noticing is that it's placing the key of the awk-array a instead of the value for my size column $13. – confusedcoder21 Sep 10 '21 at 12:04
  • I cannot reproduce your problem. Please share a link to a self-contained, minimal example in an online shell like [this one](https://www.onlinegdb.com/fork/6F4YKy_JW). – Socowi Sep 10 '21 at 12:08