Add location to duplicate names in a CSV file using Bash

Question

Using Bash create user logins. Add the location if the name is duplicated. Location should be added to the original name, as well as to the duplicates.

id,location,name,login
1,KP,Lacie,
2,US,Pamella,
3,CY,Korrie,
4,NI,Korrie,
5,BT,Queenie,
6,AW,Donnie,
7,GP,Pamella,
8,KP,Pamella,
9,LC,Pamella,
10,GM,Ericka,

The result should look like this:

id,location,name,login
1,KP,Lacie,lacie@mail.com
2,US,Pamella,uspamella@mail.com
3,CY,Korrie,cykorrie@mail.com
4,NI,Korrie,nikorrie@mail.com
5,BT,Queenie,queenie@mail.com
6,AW,Donnie,donnie@mail.com
7,GP,Pamella,gppamella@mail.com
8,KP,Pamella,kppamella@mail.com
9,LC,Pamella,lcpamella@mail.com
10,GM,Ericka,ericka@mail.com

I used AWK to process the csv file.

    cat data.csv | awk 'BEGIN {FS=OFS=","};
    NR > 1 {
    split($3, name)
    $4 = tolower($3)
    split($4, login)
    for (k in login) {
    !a[login[k]]++ ? sub(login[k], login[k]"@mail.com", $4) : sub(login[k], tolower($2)login[k]"@mail.com", $4)
    }
    }; 1' > data_new.csv

The script adds location values only to further duplicates.

id,location,name,login
1,KP,Lacie,lacie@mail.com
2,US,Pamella,pamella@mail.com
3,CY,Korrie,korrie@mail.com
4,NI,Korrie,nikorrie@mail.com
5,BT,Queenie,queenie@mail.com
6,AW,Donnie,donnie@mail.com
7,GP,Pamella,gppamella@mail.com
8,KP,Pamella,kppamella@mail.com
9,LC,Pamella,lcpamella@mail.com
10,GM,Ericka,ericka@mail.com

How do I add location to the initial one?

The output which you get according to your post looks exactly the same as the result should look like, or do I miss something? — user1934428, Nov 03 '22 at 10:59
Ah, got it! The point is that you can generate any output only after **all** lines have been processed, because you only then know what are duplicates and what aren't. Therefore in your `NR > 1` block, you can only put the current line into an array, and create for each name a count of the occurances. Then you need an `END` block, which traverses the array and outputs the mail information, based on that count. By and large, I don't see much advantage doing this in awk, since most of the logic would be in the `END`-block. Don't you want to do this in a language like Ruby or Perl? — user1934428, Nov 03 '22 at 11:19
That's the problem, I need to use Bash, I already solved it with Powershell — Dmitry Strunewsky, Nov 03 '22 at 11:27
I don't quite get it: You are **not** trying to solve it in bash in your post (bash would be possible, since bash has associative arrays), but in awk. Why is awk as a language accepted, but Perl (for instance) not? Both are available on virtually any installation. Or otherwise, why don't you then drop awk as well, and write it completely in bash? — user1934428, Nov 03 '22 at 11:51
I'm sorry, now I didn't get you. The AWK tool is optional. I used AWK because I couldn't find any other solution. — Dmitry Strunewsky, Nov 03 '22 at 12:04
Awk language and bash language is OK, and Perl is not? Well, then imlement the algorithm I outlined above in pure bash, if you prefer it over awk. You just need two arrays: An indexed one holding the line of the input file, and an associative one holding the counts of the IDs. The algorithm should be pretty straightforward. If you have problem in **finding** a suitable algorithm instead of **implementing** it in a certain language, please express this explicitly and tag your question with _algorithm_, to make it clear. — user1934428, Nov 03 '22 at 12:15
Thanks for your idea. Will try to solve this with two arrays. — Dmitry Strunewsky, Nov 03 '22 at 12:23
I can't believe you have so many users whose parents could not spell "Pamela". — tripleee, Nov 03 '22 at 13:12

score 3 · Accepted Answer · answered Nov 03 '22 at 13:18

A common solution is to have Awk process the same file twice if you need to know whether there are duplicates down the line.

Notice also that this requires you to avoid the useless use of cat.

awk 'BEGIN {FS=OFS=","};
  NR == FNR { ++seen[$3]; next }
  FNR > 1 { $4 = (seen[$3] > 1 ? tolower($2) : "") tolower($3) "@mail.com" }
  1' data.csv data.csv >data_new.csv

NR==FNR is true when you read the file the first time. We simply count the number of occurrences of $3 in seen for the second pass.

Then in the second pass, we can just look at the current entry in seen to figure out whether or not we need to add the prefix.

Add location to duplicate names in a CSV file using Bash

1 Answers1