Bash script Find Duplicate Values and modify them

Question

Have a script that parses csv file but it does not work as needed Here is some content from csv file

id,location_id,name,title,email,department
1,1,Susan houston,Director of Services,,
2,1,Christina Gonzalez,Director,,
3,2,Brenda brown,"Director, Second Career Services",,
4,3,Howard Lader,"Manager, Senior Counseling",,
5,4,Kimberly Pesavento,Commercial director,,
6,5,Joe Bloom,Financial Empowerment Programs Program Director,,
7,6,peter Olson,Director,,
8,6,Bart charlow,Executive Director,,
9,7,Bart Charlow,Executive Director,,

#!/bin/bash

#create path to redirect accounts.csv to same directory as accounts_new.csv
path=$(dirname $1)


# Substituted commmas with vertical lines, so sed command works 
awk -F'"' -v OFS='"' '{ for (i=2; i<=NF; i+=2) gsub(",", "|", $i) } 1' accounts.csv |

# Changed first letters of names to uppercase
awk -F, -v col=3 ' 
NR > 1{
  n=split(tolower($col),a," ")
  $col=toupper(substr(a[1],1,1)) substr(a[1],2)
  for(i=2;i<=n;i++) {
    $col=$col " " toupper(substr(a[i],1,1)) substr(a[i],2)
  }
}1' OFS="," |

# Generated email addresses
sed -E 's/([^,]*,([^,]*),) ?(([[:alpha:]])[^ ]* +)(([^,]*),[^,]*,)[^,]*/\1\u\3\u\5\L\4\6\@abc.com/' |

awk -F, '{for (i=5;i<=5;i++){if (v[i,$i]++){b[$i]=$i; 
  $i=split($3,arr," ")
  val=(substr($3,1,1) arr[2]$2"@abc.com")
  $5=tolower(val)
  }};print $0}' OFS="," |

# Added missing commas and sent output to new file
sed -E 's/\|/\,/g' > $path"/accounts_new.csv"

here is the output of the script

id,location_id,name,title,email,department
1,1,Susan Houston,Director of Services,shouston@abc.com,
2,1,Christina Gonzalez,Director,cgonzalez@abc.com,
3,2,Brenda Brown,Director, Second Career Services,bbrown@abc.com,
4,3,Howard Lader,Manager, Senior Counseling,hlader@abc.com,
5,4,Kimberly Pesavento,Commercial director,kpesavento@abc.com,
6,5,Joe Bloom,Financial Empowerment Programs Program Director,jbloom@abc.com,
7,6,Peter Olson,Director,polson@abc.com,
8,6,Bart Charlow,Executive Director,bcharlow@abc.com,
9,7,Bart Charlow,Executive Director,bcharlow7@abc.com,

but desired output is this

id,location_id,name,title,email,department
1,1,Susan Houston,Director of Services,shouston@abc.com,
2,1,Christina Gonzalez,Director,cgonzalez@abc.com,
3,2,Brenda Brown,Director, Second Career Services,bbrown@abc.com,
4,3,Howard Lader,Manager, Senior Counseling,hlader@abc.com,
5,4,Kimberly Pesavento,Commercial director,kpesavento@abc.com,
6,5,Joe Bloom,Financial Empowerment Programs Program Director,jbloom@abc.com,
7,6,Peter Olson,Director,polson@abc.com,
8,6,Bart Charlow,Executive Director,bcharlow6@abc.com,
9,7,Bart Charlow,Executive Director,bcharlow7@abc.com,

As you can see from last 2 rows location_id is appended only the second equal email but the first one remains untouched, how can I add location_id to the first one also?

Can you help me? Thanks!

I tried to use while instead of if statement but it didn't help me

is the `id` column unique within the file? is `location_id` *guaranteed* to be unique if `name` is duplicated? — markp-fuso, Feb 22 '23 at 21:34
Id column is unique within file and yes location_id is unique as well, the problem is there are names that contain the same first letter of first name but the same last name and together they compose the equal emails. — Narek Arakelyan, Feb 22 '23 at 21:39
Don't use `abc.com` as your example domain as that's a real domain so posting those email addresses, which may be fake but may accidentally be real, online here could cause problems at that domain when spammers find those addresses online and start sending to them. Use `example.com` instead. — Ed Morton, Feb 23 '23 at 00:29
What should the email address be if the `name` has more than 2 parts, e.g. `Billy Jean King`? — Ed Morton, Feb 23 '23 at 01:01

markp-fuso · Accepted Answer · 2023-02-23T14:36:13.247

Assumptions/understandings:

some input fields may be wrapped in double quotes; these double quotes are not to be removed (unlike OP's desired output)
since commas show up as both delimiter and data we'll temporarily convert the data instances to another character that does not show up in the data; for this exercise we'll use the pipe character (|), otherwise we'll need to pick another substitution character
all name fields consist of exactly two space-delimited strings (ie, don't have to worry about names like Bob B Burns III nor Lastname, Firstname)
both of the space-delimited strings in the name field are to have just the initial character uppercased (eg, we don't have to worry about names with mixed case like dePalma or McDonald)
if we end up with duplicate email accounts the location_id is guaranteed to be unique
input file can fit into memory (as awk array)

Once awk is pulled into the mix there's rarely a need for sed nor additional awk scripts, ie, we should be able to perform the entire operation with a single awk script.

One awk idea:

awk '
BEGIN { FS="\""; OFS="," }                              # input is delimited by double qutoes
NR==1 { print; next }
      { line=""
        for (i=1;i<NF;i+=2) {                           # loop through odd numbered fields
            gsub(/,/,"|",$(i+1))                        # in even numbered double-quote-delimited fields replace commas with pipes
            line=line $i FS $(i+1) FS                   # rebuild the current line
        }
        line=line $NF                                   # add last field to new line

        split(line,a,",")                               # split new line on commas
        split(tolower(a[3]),b,/[[:space:]]+/)           # split tolower(name field) on white space

        # rebuild name with first characters of first/last names uppercased

        name=toupper(substr(b[1],1,1)) substr(b[1],2) " " toupper(substr(b[2],1,1)) substr(b[2],2)

        acct=substr(b[1],1,1) b[2]                      # build email acct name

        lines[NR]=a[1] OFS a[2] OFS name OFS a[4]       # rebuild current line based on first 4 fields
        locid[NR]=a[2]                                  # make note of location_id for current line
        email[NR]=acct                                  # make note of email acct for current line
        count[acct]++                                   # keep count of number of times we see this email acct
      }

END   { for (i=2;i<=NR;i++) {                           # loop through our lines of output
            gsub(/\|/,",",lines[i])                     # replace pipes with original commas

            # print final line of output; if email acct has been seen more than once then append the location_id to the email acct; add the "@abc.com" domain and the trailing comma

            print lines[i] OFS email[i] (count[email[i]] > 1 ? locid[i] : "") "@abc.com" OFS
        }
      }
' accounts.csv

This generates:

id,location_id,name,title,email,department
1,1,Susan Houston,Director of Services,shouston@abc.com,
2,1,Christina Gonzalez,Director,cgonzalez@abc.com,
3,2,Brenda Brown,"Director, Second Career Services",bbrown@abc.com,
4,3,Howard Lader,"Manager, Senior Counseling",hlader@abc.com,
5,4,Kimberly Pesavento,Commercial director,kpesavento@abc.com,
6,5,Joe Bloom,Financial Empowerment Programs Program Director,jbloom@abc.com,
7,6,Peter Olson,Director,polson@abc.com,
8,6,Bart Charlow,Executive Director,bcharlow6@abc.com,
9,7,Bart Charlow,Executive Director,bcharlow7@abc.com,

Thank you , this worked but with one issue: there are some values in department column that are missing after running the script, can they be retained? — Narek Arakelyan, Feb 23 '23 at 06:52
after `locid[NR]` you could add `dept[NR]=a[6]`, then include `dept[i]` in the final `print`; if you're unable to get this to work I suggest you ask a new question ... making sure to provide youre attempted code change along with sample data that includes `department` data — markp-fuso, Feb 23 '23 at 14:35
I asked new question accepting this as answer, can you help me with that? thank you — Narek Arakelyan, Feb 23 '23 at 15:26

Ed Morton · Answer 2 · 2023-02-23T01:06:33.020

2

A 2-pass approach using any awk:

$ cat tst.awk
BEGIN { FS=OFS="," }
NR == 1 { print }
FNR == 1 { next }
{ lc_name = tolower($3) }
NR == FNR {
    cnt[lc_name]++
    next
}
{
    n = split(lc_name,parts," ")
    for ( i=1; i<=n; i++ ) {
        name = ( i>1 ? name " " : "" ) toupper(substr(parts[i],1,1)) substr(parts[i],2)
        email = ( i>1 ? email parts[i] : substr(parts[i],1,1) )
    }

    $3 = name
    $(NF-1) = email ( cnt[lc_name] > 1 ? $2 : "" ) "@example.com"
    print
}

$ awk -f tst.awk accounts.csv accounts.csv
id,location_id,name,title,email,department
1,1,Susan Houston,Director of Services,shouston@example.com,
2,1,Christina Gonzalez,Director,cgonzalez@example.com,
3,2,Brenda Brown,"Director, Second Career Services",bbrown@example.com,
4,3,Howard Lader,"Manager, Senior Counseling",hlader@example.com,
5,4,Kimberly Pesavento,Commercial director,kpesavento@example.com,
6,5,Joe Bloom,Financial Empowerment Programs Program Director,jbloom@example.com,
7,6,Peter Olson,Director,polson@example.com,
8,6,Bart Charlow,Executive Director,bcharlow6@example.com,
9,7,Bart Charlow,Executive Director,bcharlow7@example.com,

In the above I assumed that a name with more than 2 parts like Billy Jean King should result in an email address like bjeanking@example.com. It's a trivial tweak if you wanted anything else.

I also didn't remove the quotes from your fields as that would corrupt your data but if you really want that then, again, it's a trivial tweak.

edited Feb 23 '23 at 01:06

answered Feb 23 '23 at 00:59

Ed Morton

188,023
17
78
185

Thank you, I tried to integrate the script with bash script of mine so the accounts.csv is an argument to the script but the result is only the first line printed. – Narek Arakelyan Feb 23 '23 at 08:17
Here is what I did #!/bin/bash #create path path=$(dirname $1) awk 'BEGIN { FS=OFS="," } NR == 1 { print } FNR == 1 { next } { lc_name = tolower($3) } NR == FNR { cnt[lc_name]++ next } { n = split(lc_name,parts," ") for ( i=1; i<=n; i++ ) { name = ( i>1 ? name " " : "" ) toupper(substr(parts[i],1,1)) substr(parts[i],2) email = ( i>1 ? email parts[i] : substr(parts[i],1,1) ) } $3 = name $(NF-1) = email ( cnt[lc_name] > 1 ? $2 : "" ) "@example.com" print }' $1 > $path"/accounts_new.csv" – Narek Arakelyan Feb 23 '23 at 08:19
and result is this id,location_id,name,title,email,department – Narek Arakelyan Feb 23 '23 at 08:20
As I said in my answer it's a 2-pass solution so you have to list the input file name twice as I show, i.e. not `awk 'script' "$1"` in your code but `awk 'script' "$1" "$1"`. – Ed Morton Feb 23 '23 at 11:26
Oh sorry, now I got it.. but now problem is that I have names that have the same first letter of first name but different name and same last name for instance Elizabeth Meeks, Erika Meeks that generate the same email address. How can location address added to those addresses too? Thank you! – Narek Arakelyan Feb 23 '23 at 12:04
[Chameleon questions](https://meta.stackexchange.com/questions/43478/exit-strategies-for-chameleon-questions) are strongly discouraged on this forum. Accept an answer to the question you asked and then ask a new question that covers any requirements like that that you forgot to include in your question. – Ed Morton Feb 23 '23 at 12:06
sorry for that, I'm new to here and tried to formulate my question as short as possible... – Narek Arakelyan Feb 23 '23 at 12:55

Bash script Find Duplicate Values and modify them

2 Answers2

Linked

Related