1

I apologize in advance if this seems like a simple question. However, I am a beginner in bash commands and scripting, so I hope you all understand why I am not able to solve this on my own.

What I want to achieve is to change the values in one field of a csv file to uppercase, and then resave the csv file with the transformed field and all the other fields included, each retaining their index.

For instance, I have this csv:

1,Jun 4 2021,car,4856
2,Jul 31 2021,car,4154
3,Aug 14 2021,bus,4070
4,Aug 2 2021,car,4095

I want to transform the third field that holds the vehicle type into uppercase - CAR, BUS, etc. and then resave the csv file with the transformed field.

I have tried using the 'tr' command thus:

cut -d"," -f4 data.csv | tr '[:lower:]' '[:upper:]'

This takes the field and does the transformation. But how do I paste and replace the column in the csv file? It did not work because the field argument cannot be passed into the tr command.

Quddus A.
  • 33
  • 1
  • 6
  • 1
    You say you want to transform the 3rd field but you also say `cut -d"," -f4 data.csv` (note - selecting the **4th** field) works, please explain or fix that. Also, clarify what `each retaining their index` means. – Ed Morton Jan 16 '22 at 10:06
  • @Ed Morton, That was a typo from my end. `-f4` should be `-f3`. Apologies. What I am trying to do is to transform the third field in the input csv file named data.csv into uppercase, then resave the csv file with another name. I hope that gives more clarity – Quddus A. Jan 16 '22 at 11:12
  • No problem, please just [edit] your question to fix that and explain what `each retaining their index` means or remove that statement if it's meaningless. – Ed Morton Jan 16 '22 at 11:13

3 Answers3

1

A gnu sed solution:

sed -i -E 's/^(([^,]+,){2})([^,]+)/\1\U\3/' file.csv

cat file

1,Jun 4 2021,CAR,4856
2,Jul 31 2021,CAR,4154
3,Aug 14 2021,BUS,4070
4,Aug 2 2021,CAR,4095

Explanation:

  • ^: Start
  • (([^,]+,){2}): Match first 2 fields and capture them in group #1
  • ([^,]+): Match 3rd field and capture it in group #3
  • \1: Put capture value of group #1 back in replacement
  • \U\3: Put uppercase capture value of group #3 back in replacement

Or a gnu-awk solution:

awk -i inplace 'BEGIN {FS=OFS=","} {$3 = toupper($3)} 1' file.csv
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Thank you. The awk solution does not seem to work on my system. It says -i is not an option. The sed solution works normally, however, it failed when I added it to my Airflow DAG file. Submission failed. It reported some syntax (Unicode) error. I think the issue was with the fields selection {2}. It was trying to read it as bytes even when the whole bash command has been correctly quoted into a string. What do you think I can do? Is there any way to bypass this hurdle? Perhaps, a modification to the sed command? – Quddus A. Jan 16 '22 at 10:56
  • 1
    That's why it says that you need GNU Awk specifically. With a regular Awk, you can remove the `-i inplace` and save the output to a temporary file, then move it back on top of the original file. – tripleee Jan 16 '22 at 11:06
  • As @tripleee has rightly suggested you can use: `awk 'BEGIN {FS=OFS=","} {$3 = toupper($3)} 1' file.csv > _out && mv _out file.csv` to get it working in any non-gnu `awk`. I have no idea about the `Airflow DAG file`. `gnu sed` command works fine on any terminal or in any shell script. – anubhava Jan 16 '22 at 11:18
1

With GNU awk:

awk -i inplace 'BEGIN{FS=","; OFS=","} {$3=toupper($3)} {print}' file

Output to file:

1,Jun 4 2021,CAR,4856
2,Jul 31 2021,CAR,4154
3,Aug 14 2021,BUS,4070
4,Aug 2 2021,CAR,4095

See: How can I change a certain field of a file into upper-case using awk?, Save modifications in place with awk and 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

Cyrus
  • 84,225
  • 14
  • 89
  • 153
1

Using the cut and tr, you need to add paste to the mix.

SEP=","
IN="data.csv"

paste -d$SEP \
  <( <$IN cut -d$SEP -f1,2 ) \
  <( <$IN cut -d$SEP -f3 | tr '[:lower:]' '[:upper:]' ) \
  <( <$IN cut -d$SEP -f4 )

I did factor out the repeating things - separator and input file - into variables SEP and IN respectively.

How it all works:

  • get the untransformed columns before #3
  • get col #3 and transform it with tr
  • get the remaining columns
  • paste it all together, line by line
  • the need for intermediate files is avoided by using shell substitution

Downsides:

  • the data seems to be read 3 times, but disk cache will help a lot
  • the data is parsed 3 times, for sure (by cut)
  • but unless your input is a few gigabytes, this does not matter
liborm
  • 2,634
  • 20
  • 32