Search duplicates in a column, add value

Question

Convert file input.csv.

id,location_id,organization_id,service_id,name,title,email,department
36,,,22,Joe Smith,third-party,john.smith@example.org,third-party Applications
18,11,,,Dave Genesy,Head of office,,
14,9,,,David Genesy,Library Director,,
22,14,,,Andres Espinoza, Manager Commanding Officer,,

(Done!) Need to update column name. Name format: first letter of name/surname uppercase and all other letters lowercase.
(Done!) Need to update column email with domain @abc.Email format: first letter from name and full surname, lowercase
(Not done) Emails with the same ID should contain numbers. Example: Name Max Houston, email mhouston1@examples.com etc.

#!/bin/bash
inputfile="accounts.csv"
echo "id,location_id,organization_id,service_id,name,title,email,department" > accounts_new.csv
while IFS="," read -r rec_column1 rec_column2 rec_column3 rec_column4 rec_column5 rec_column6 rec_column7 rec_column8
do
   surnametemp="${rec_column5:0:1}$(echo $rec_column5 | awk '{print $2}')"
   namesurname=$(echo $rec_column5 | sed 's! .!\U&!g')
   echo $rec_column1","$rec_column2","$rec_column3","$rec_column4","$namesurname","$rec_column6",""${surnametemp,,}@abc.com"","$rec_column8 >>accounts_new.csv
done < <(tail -n +2 $inputfile)

How can do that?

Outputfile

id,location_id,organization_id,service_id,name,title,email,department
14,9,,,Dave Genesy,Library Director,dgenesy@abc.com,
14,9,,,David Genesy,Library Director,dgenesy2@abc.com,
15,9,,,maria Kramer,Library Divisions Manager,mkramer@abc.com,
26,18,,,Sharon Petersen,Administrator,spetersen@abc.com,
27,19,,,Shen Petersen,Administrator,spetersen2@abc.com,

Welcome to SO and special thanks for adding your efforts in your question, keep it up. Could you please also post sample of expected output in your question(not in comments please) for more clarity of question. — RavinderSingh13, Oct 29 '20 at 05:06
Related: [What's the most robust way to efficiently parse CSV using awk?](https://stackoverflow.com/q/45420535/8344060) — kvantour, Oct 29 '20 at 09:14

Raman Sailopal · Answer 1 · 2020-10-30T08:57:43.780

0

With the input (mailcsv) file as:

id,location_id,organization_id,service_id,name,title,email,department
14,9,,,Dave Genesy,Library Director,dgenesy@abc.com,
14,9,,,David Genesy,Library Director,dgenesy@abc.com,
15,9,,,maria Kramer,Library Divisions Manager,mkramer@abc.com,
26,18,,,Sharon Petersen,Administrator,spetersen@abc.com,
27,19,,,Shen Petersen,Administrator,spetersen2@abc.com,

You can use awk and so:

awk -F, ' NR>1 { mails[$7]+=1;if ( mails[$7] > 1 ) { OFS=",";split($7,mail1,"@");$7=mail1[1]mails[$7]"@"mail1[2] } else { $0=$0 } }1' mailscsv

Set the field delimiter to , and then create an array keyed by email address. Increment the index every time the email address is encountered. If there is more than one occurrence of the address, split the address into another array mail1 based on "@". Set $7 to the first index of the array mail1 (email address before @) followed by the value of mails index for the email address, then "@" and the second index of mail1 (the section after @) If there is only one occurrence of the email address simple set the whole line as is. Use 1 to print the line.

edited Oct 30 '20 at 08:57

answered Oct 29 '20 at 09:01

Raman Sailopal

12,320
2
11
18

There are not commas in the string where add index. – meatball Oct 29 '20 at 16:48
There are not commas in the array where add index. 15,9,,,maria Kramer,Library Divisions Manager,mkramer@abc.com, 16 10 Dave Genesy tester dgenesy2@abc.com 17 10 Maria Kramer Library Division Manager mkramer2@abc.com 18 11 Dave Genesy Head of office dgenesy3@abc.com 19,11,,,Elizabeth Meeks,Branch Manager,emeeks@abc.com, – meatball Oct 29 '20 at 16:50
`15,9,,,maria Kramer,Library Divisions Manager,mkramer@abc.com, 16 10 Dave Genesy tester dgenesy2@abc.com 17 10 Maria Kramer Library Division Manager mkramer2@abc.com 18 11 Dave Genesy Head of office dgenesy3@abc.com 19,11,,,Elizabeth Meeks,Branch Manager,emeeks@abc.com, ` – meatball Oct 29 '20 at 16:51
I had to add the output field separator (OFS) as ; I have amended the answer – Raman Sailopal Oct 30 '20 at 08:58

Marcin Piechota · Answer 2 · 2020-10-29T10:32:17.633

Task specification

This task would be much easier if specified otherwise:

add email iterator to every email or
add email iterator to second,third... occurrence

But it was specified: add email iterator to every email if email is used multiple times.

This specification requires double iteration through lines, thus making this task more difficult.

The right tool

My rule of thumb is: use pure bash tools (grep, sed, etc) for simple tasks, use awk for moderate tasks and python for complicated tasks. In this case (double iteration over lines) I would use python. However, there was not python tag in problem specification, so I used awk.

Solution

<accounts.csv \
gawk -vFPAT='[^,]*|"[^"]*"' \
'
BEGIN { 
  OFS = "," 
}; 
{ 
  if ($7 == "") { 
    split($5,name," "); 
    firstname = substr(tolower(name[1]),1,1); 
    lastname = tolower(name[2]); 
    domain="@abc.com"; 
    $7=firstname "." lastname domain;
  }; 
  emailcounts[$7]++; 
  immutables[++iter]=$1","$2","$3","$4","$5","$6","$8; 
  emails[iter]=$7;
} 
END { 
  for (iter in immutables) {
    if (emailcounts[emails[iter]] > 1) {
      emailiter[emails[iter]]++; 
      email=gensub(/@/, emailiter[emails[iter]]"@", "g", emails[iter]);
    } else {
      email=emails[iter]
    }; 
    print immutables[iter], email
  }
}'

Results

id,location_id,organization_id,service_id,name,title,department,email
36,,,22,Joe Smith,third-party,third-party Applications,john.smith@example.org
18,11,,,Dave Genesy,Head of office,,d.genesy1@abc.com
14,9,,,David Genesy,Library Director,,d.genesy2@abc.com
22,14,,,Andres Espinoza,"Manager, Commanding Officer",,a.espinoza@abc.com

Explanation

-vFPAT='[^,]*|"[^"]*"' read csv
$7=firstname "." lastname domain;} substitute email field
emailcounts[$7]++ count email occurences
iter iterator to preserve order
immutables[++iter]=$1","$2","$3","$4","$5","$6","$8 save non email fields for second loop
emails[iter]=$7 save email for second loop
for (iter in immutables) iterate over keys in immutables dictionary
{if (emailcounts[emails[iter]] > 1) change email if more than 1 occurence
emailiter[emails[iter]]++ increase email iterator
email=gensub(/@/, emailiter[emails[iter]]"@", "g", emails[iter]) add iterator to email
print immutables[iter], email print

There is a useless use of cat, in the above solution. Otherwise .. very well explained. — kvantour, Oct 29 '20 at 09:11
@kvantour agreed in terms of efficiency. It is even worse than useless. However, it is maybe my personal choice to see what is the file I am parsing at the beginning. — Marcin Piechota, Oct 29 '20 at 09:18