Transforming a field in a csv file and resaving to another file with bash

Question

I apologize in advance if this seems like a simple question. However, I am a beginner in bash commands and scripting, so I hope you all understand why I am not able to solve this on my own.

What I want to achieve is to change the values in one field of a csv file to uppercase, and then resave the csv file with the transformed field and all the other fields included, each retaining their index.

For instance, I have this csv:

1,Jun 4 2021,car,4856
2,Jul 31 2021,car,4154
3,Aug 14 2021,bus,4070
4,Aug 2 2021,car,4095

I want to transform the third field that holds the vehicle type into uppercase - CAR, BUS, etc. and then resave the csv file with the transformed field.

I have tried using the 'tr' command thus:

cut -d"," -f4 data.csv | tr '[:lower:]' '[:upper:]'

This takes the field and does the transformation. But how do I paste and replace the column in the csv file? It did not work because the field argument cannot be passed into the tr command.

You say you want to transform the 3rd field but you also say `cut -d"," -f4 data.csv` (note - selecting the **4th** field) works, please explain or fix that. Also, clarify what `each retaining their index` means. — Ed Morton, Jan 16 '22 at 10:06
@Ed Morton, That was a typo from my end. `-f4` should be `-f3`. Apologies. What I am trying to do is to transform the third field in the input csv file named data.csv into uppercase, then resave the csv file with another name. I hope that gives more clarity — Quddus A., Jan 16 '22 at 11:12
No problem, please just [edit] your question to fix that and explain what `each retaining their index` means or remove that statement if it's meaningless. — Ed Morton, Jan 16 '22 at 11:13

anubhava · Answer 1 · 2022-01-16T08:26:18.563

1

A gnu sed solution:

sed -i -E 's/^(([^,]+,){2})([^,]+)/\1\U\3/' file.csv

cat file

1,Jun 4 2021,CAR,4856
2,Jul 31 2021,CAR,4154
3,Aug 14 2021,BUS,4070
4,Aug 2 2021,CAR,4095

Explanation:

^: Start
(([^,]+,){2}): Match first 2 fields and capture them in group #1
([^,]+): Match 3rd field and capture it in group #3
\1: Put capture value of group #1 back in replacement
\U\3: Put uppercase capture value of group #3 back in replacement

Or a gnu-awk solution:

awk -i inplace 'BEGIN {FS=OFS=","} {$3 = toupper($3)} 1' file.csv

edited Jan 16 '22 at 08:26

answered Jan 16 '22 at 08:19

anubhava

761,203
64
569
643

Thank you. The awk solution does not seem to work on my system. It says -i is not an option. The sed solution works normally, however, it failed when I added it to my Airflow DAG file. Submission failed. It reported some syntax (Unicode) error. I think the issue was with the fields selection {2}. It was trying to read it as bytes even when the whole bash command has been correctly quoted into a string. What do you think I can do? Is there any way to bypass this hurdle? Perhaps, a modification to the sed command? – Quddus A. Jan 16 '22 at 10:56
1

That's why it says that you need GNU Awk specifically. With a regular Awk, you can remove the `-i inplace` and save the output to a temporary file, then move it back on top of the original file. – tripleee Jan 16 '22 at 11:06
As @tripleee has rightly suggested you can use: `awk 'BEGIN {FS=OFS=","} {$3 = toupper($3)} 1' file.csv > _out && mv _out file.csv` to get it working in any non-gnu `awk`. I have no idea about the `Airflow DAG file`. `gnu sed` command works fine on any terminal or in any shell script. – anubhava Jan 16 '22 at 11:18

score 1 · Answer 2 · answered Jan 16 '22 at 08:25

With GNU awk:

awk -i inplace 'BEGIN{FS=","; OFS=","} {$3=toupper($3)} {print}' file

Output to file:

1,Jun 4 2021,CAR,4856
2,Jul 31 2021,CAR,4154
3,Aug 14 2021,BUS,4070
4,Aug 2 2021,CAR,4095

See: How can I change a certain field of a file into upper-case using awk?, Save modifications in place with awk and 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

score 1 · Answer 3 · answered Jan 16 '22 at 10:20

Using the cut and tr, you need to add paste to the mix.

SEP=","
IN="data.csv"

paste -d$SEP \
  <( <$IN cut -d$SEP -f1,2 ) \
  <( <$IN cut -d$SEP -f3 | tr '[:lower:]' '[:upper:]' ) \
  <( <$IN cut -d$SEP -f4 )

I did factor out the repeating things - separator and input file - into variables SEP and IN respectively.

How it all works:

get the untransformed columns before #3
get col #3 and transform it with tr
get the remaining columns
paste it all together, line by line
the need for intermediate files is avoided by using shell substitution

Downsides:

the data seems to be read 3 times, but disk cache will help a lot
the data is parsed 3 times, for sure (by cut)
but unless your input is a few gigabytes, this does not matter

Transforming a field in a csv file and resaving to another file with bash

3 Answers3