Edit CSV rows in two different ways

Question

I have a bash script that outputs two CSV columns. I need to prepend the three-digit number of those rows of the second column that contain them with "f. " and keep the rest of the rows intact. I have tried different ways so far but each has failed in one way or another.

What I've tried mainly has been to use regular expressions with either the first or second column to separate the desired rows from the rest, but I can't separate and prepend at the same time without cancelling out or messing up the process somehow. Some of the commands I've used so far have been: $ sed $ cut as well as (nested) for loops, read-while loops, if/else and if/else/elif statements, etc. What follows is one such (failed) solution:

for var1 in "^.*_[^f]_.*"
do
    sed -i "" "s:$MSname::" $pathToCSV"_final.csv"
    for var2 in "^.*_f_.*"
    do
        sed -i "" "s:$MSname:f.:" $pathToCSV"_final.csv"
    done
done

And these are some sample rows:

abc_deg0014_0001_a_1.tif,British Library 1 Front Board Outside
abc_deg0014_0002_b_000.tif,British Library 1 Front Board Inside
abc_deg0014_0003_f_001r.tif,British Library 1 001r
abc_deg0014_0004_f_001v.tif,British Library 1 001v
…
abc_deg0014_0267_f_132r.tif,British Library 1 132r
abc_deg0014_0268_f_132v.tif,British Library 1 132v
abc_deg0014_0269_y_999.tif,British Library 1 Back Board Inside
abc_deg0014_0270_z_1.tif,British Library 1 Back Board Outside

Here $MSname = British Library 1 (since with different CSVs the "British Library 1" part can change to other words that I need to remove/replace and that's why I use parameter expansion).

The desired result:

abc_deg0014_0002_b_000.tif,Front Board Inside
abc_deg0014_0003_f_001r.tif,f. 001r
…
abc_deg0014_0268_f_132v.tif,f. 132v
abc_deg0014_0269_y_999.tif,Back Board Inside

If you look closely, you'll notice these rows are also differentiated from the rest by "f" in their first column (the rows that shouldn't get the "f. " in front of their second column are differentiated by "a", "b", "y", and "z", respectively, in the first column).

You set your variable `var1` to the string `^.*_[^f]_.*`, and similarily `var2`, but you never use these variables. What's the point in having them? — user1934428, Jun 18 '19 at 06:20
A `for` loop over a single string is equivalent to just assigning the string to the loop variable, with the minor difference that you can `break` out of the loop to go directly to the line after `done` (which you are not using here anyway). — tripleee, Jun 18 '19 at 07:44

tripleee · Accepted Answer · 2019-06-18T08:04:28.623

You are not using var1 or var2 for anything, and even if you did, looping over variables and repeatedly running sed -i on the same output file is extremely wasteful. Ideally, you would like to write all the modifications into a single sed script, and process the file only once.

Without being able to guess what other strings than "British Library 1" you have and whether those require different kinds of actions, I would suggest something along the lines of

sed -i '/^[^,]*_f_[^,_]*,/s/,British Library 1 /,f. /
    s/,British Library 1 /,/' "${pathToCSV}_final.csv"

Notice how the sed script in single quotes can be wrapped over multiple physical lines. The first line finds any lines where the last characters between underscores in the first comma-separated column is f, and replaces ",British Library 1 " with ",f. ". (I made some adjustments to the spacing here -- I hope they make sense for you.) On the following line, we simply replace any (remaining) occurrences of ",British Library 1 " with just a comma; the idea is that only the lines which didn't match the regex on the previous line will still contain this string, and so we don't have to do another regex match.

This can easily be extended to cover more patterns in the same sed script, rather than repeatedly looping over the file and rewriting one pattern at a time. For example, if your next task is to replace Windsor Palace A with either a. or nothing depending on whether the penultimate underscore-separated subfield in the first field contains a, that should be obvious enough:

sed -i '/^[^,]*_f_[^,_]*,/s/,British Library 1 /,f. /
    s/,British Library 1 /,/
    /^[^,]*_a_[^,_]*,/s/,Windsor Palace A /,a. /
    s/,Windsor Palace A /,/' "${pathToCSV}_final.csv"

In some more detail, the regex says

^       beginning of line
[^,]*   any sequence of characters which are not a comma
_f_     literal characters underscore, f, underscore
[^,_]*  any sequence of characters which are not a comma or an underscore 
,       literal comma

You should be able to see that this will target the last pair of underscores in the first column. It's important to never skip across the first comma, and near the end, not allow any underscores after the ones we specifically target before we finally allow the comma column delimiter.

Finally, also notice how we always use double quotes around variables which contain file names. There are scenarios where you can avoid this but you have to know what you are doing; the easy and straightforward rule of thumb is to always put double quotes around variables. For the full scoop, see When to wrap quotes around a shell variable?

Fantastic! This worked perfectly. Only I had to modify single quotes to double quotes (and also move the other opening double quote in the title of the CSV to after its own `$pathToCSV`) as I'm using parameter expansion inside (i.e. `$MSname` I had in my original post), which takes care of the variable string "British Library 1": ```echo MS name as recorded in TIFF header to be removed: read MSname sed -i "" "/^[^,]*_f_[^,_]*,/s/,$MSname /,f. / s/,$MSname /,/" $pathToCSV"_final.csv"``` PS: I also had to add `""` after the `-i` flag because of my shell's syntax (in)compatibility. — Kay Gee, Jun 18 '19 at 17:54
The `_final.csv` doesn't have to be in quotes but the variable just before it very much does. I put all of it inside quotes for convenience; but there is no sane way you could have needed to move *that* part outside the quotes. Switching the single quotes to double makes sense if you want to use a variable inside the quotes; but the proper solution would be to remove all shell loops and build a single `sed` script which contains all the substitutions you want to perform. — tripleee, Jun 18 '19 at 18:09
It works either way (I tried both, i.e. with only the variable inside quotes as you suggested in your comment above and then with the fixed part `_final.csv` inside quotes as I had it originally). It gives me the `undefined label` error when this separation (between the variable and the fixed part) is removed, which makes sense. It seems it's the separation that counts and not where the quotes are inserted so as long as I keep the variable distinct from the fixed part of the CSV title, one way or the other, it's fine. — Kay Gee, Jun 18 '19 at 19:43
That's why I put in the braces around the variable name; without them, `$pathToCSV_final` looks for a variable whose name is `pathToCSV_final`. As long as the value doesn't contain whitespace or shell metacharacters, you can get away without quoting it, but (as explained in more depth in the quoting question I linked to) it is prone to produce errors when you try it on real-world filenames which might contain both, and is usually hard to debug, especially if you are unfamiliar with shell script. — tripleee, Jun 19 '19 at 04:53

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

0

With awk, you can look at the firth field to see whether it matches "3digits + 1 letter" then print with f. in this case and just remove fields 2,3 and 4 in the other case. For example:

awk -F'[, ]' '{
   if($5 ~ /.?[[:digit:]]{3}[a-z]$/) {
      printf("%s,f. %s\n",$1,$5)} 
   else {
      printf("%s,%s %s %s\n",$1,$5,$6,$7)
   }
 }' test.txt

On the example you provide, it gives:

abc_deg0014_0001_a_1.tif,Front Board Outside

abc_deg0014_0002_b_000.tif,Front Board Inside

abc_deg0014_0003_f_001r.tif,f. 001r

abc_deg0014_0004_f_001v.tif,f. 001v

abc_deg0014_0267_f_132r.tif,f. 132r

abc_deg0014_0268_f_132v.tif,f. 132v

abc_deg0014_0269_y_999.tif,Back Board Inside

abc_deg0014_0270_z_1.tif,Back Board Outside

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 18 '19 at 07:56

xiawi

1,772
4
19
21

This has some potentially unfortunate assumptions about a fixed number of space-separated fields. – tripleee Jun 18 '19 at 08:05
Indeed. the alternative was to make the assumption that the string `British Library 1` is fixed and replace it, as you propose. Your solution can nevertheless be seen as more constrained since in addition to the number of fields it require to fix the content. It really depends on the actual data and need. – xiawi Jun 18 '19 at 09:31

Edit CSV rows in two different ways

2 Answers2