Search on the last column with \ delimiter and save the email address associated to it to a variable

Question

I have two files.

file1.txt contains:

META GAIN CORP
GG$
ABG$
PEPRA_UAT
12GHR
CC$
USDP_MAIN
XQ$
PR$
MIX_DEV

and file2.csv contains:

\\fr.usdp.org\SOLE\Home\RD,Mailbox.FRmeshare@usdp.org
\\fr.usdp.org\SOLE\SHARES\FR\USDP WATER\99 FLOOR,Jay.Pau@usdpwater.org
\\fr.usdp.org\SOLE\SHARES\FR\USDP WATER\44 FLOOR,Jay.Pau@usdpwater.org
\\fr.usdp.org\SOLE\SHARES\FR\USDP WATER\META GAIN CORP,Mary.White@usdpwater.org
\\fr.usdp.org\SOLE\SHARES\FR\USDP WATER\META GAIN CORP,Sed.Rasonn@usdpwater.org
\\fr.usdp.org\SOLE\SHARES\FR\USDP WATER\META GAIN CORP,Farah.Karlus@usdpwater.org
\\fr.usdp.org\SOLE\SHARES\FR\USDP WATER\META GAIN CORP,Mer.Sus@usdpwater.org
\\fr.usdp.org\SOLE\Shares\FR\USDP WATER\ABG$,Geboi.torm@usdpwater.org
\\fr.usdp.org\SOLE\Shares\FR\USDP WATER\ABG$,Geboi.torm@usdpwater.org
\\fr.usdp.org\SOLE\Shares\FR\USDP WATER\ABG$,Josua.Durant@usdpwater.org
\\fr.usdp.org\SOLE\SHARES\FR\USDP WATER\HHR DATABASE,Geboi.torm@usdpwater.org
\\fr.usdp.org\SOLE\SHARES\FR\USDP WATER\HHR DB2 EDU,Geboi.torm@usdpwater.org
\\fr.usdp.org\SOLE\SHARES\FR\USDP WATER\HHR DB2 EDU,Alex.Gold@usdp.org
\\fr.usdp.org\SOLE\SHARES\FR\USDP WATER\NICE SHORT,Leni.Braft@usdpwater.org
\\fr.usdp.org\SOLE\SHARES\FR\USDP WATER\PRO DEV,Kath.wetfield@usdpwater.org
\\fr.usdp.org\SOLE\SHARES\FR\USDP WATER\DUK 20154 USER,
\\fr.usdp.org\SOLE\SHARES\FR\USDP WATER\DUK 20154 USER,Carlo.Gomez@usdpwater.org
\\fr.usdp.org\SOLE\SHARES\FR\USDP WATER\FARE GRUST,Jason.Desanre@usdpwater.org
\\fr.usdp.org\SOLE\SHARES\FR\USDP WATER\XYZ GROUP,Aaron.Lee@usdpwater.org
\\fr.usdp.org\SOLE\SHARES\FR\USDP WATER\XYZ TEAM TOOLKIT,Aaron.Lee@usdpwater.org
\\fr.usdp.org\SOLE\SHARES\FR\USDP WATER\BILLING ELEMENT,Matheo.Logan@usdpwater.org
\\fr.usdp.org\SOLE\SHARES\FR\USDP WATER\RRT_SEC,John.Tian@usdpwater.org

had this on my script but I can't exactly get the last column if there are spaces.

for sr in `cat file1.txt`; do
            sname=`echo ${sr} | awk -F: '{ print $1 }'`
            emdrs=`grep -Fw "${sname}" file2.csv | awk -F',' '{print$2}' | sed 's/[[:space:]]//' | xargs | sed -e 's/ /,/g'`
            echo "$sname || To: $emdrs" >> details.txt
done

details.txt output

META || Mary.White@usdpwater.org,Sed.Rasonn@usdpwater.org,Farah.Karlus@usdpwater.org,Mer.Sus@usdpwater.org
GAIN || Mary.White@usdpwater.org,Sed.Rasonn@usdpwater.org,Farah.Karlus@usdpwater.org,Mer.Sus@usdpwater.org
CORP || Mary.White@usdpwater.org,Sed.Rasonn@usdpwater.org,Farah.Karlus@usdpwater.org,Mer.Sus@usdpwater.org

but what i wanted is that

META GAIN CORP || To: Mary.White@usdpwater.org,Sed.Rasonn@usdpwater.org,Farah.Karlus@usdpwater.org,Mer.Sus@usdpwater.org

and I should also be able to search string with $ like this one ABG$ ) and not including the duplicate email.

ABG$ || To: Geboi.torm@usdpwater.org,Josua.Durant@usdpwater.org

Any help will be greatly appreciated.

mhutter · Accepted Answer · 2022-02-15T23:01:11.457

1

Something like this?

while read -r sr; do
  emails="$(grep -F "\\${sr}," file2.csv | cut -d',' -f2 | sort -u | tr -d '\r' | paste -sd',')"
  if [ -n "$emails" ]; then
    echo "$sr || To: $emails"
  fi
done < file1.txt

Some explanations:

grep -F - treat pattern ($sr) as fixed strings and not regular expressions to avoid $ matching end of line
cut -d',' -f2 - Cut the result at the comma and only output the 2nd part
sort -u - remove duplicates
tr -d '\r' - remove carriage returns
paste -sd',' - join lines with comma
if [ -n "$emails" ] only output if $emails is not empty

edited Feb 15 '22 at 23:01

answered Feb 15 '22 at 15:35

mhutter

2,800
22
30

thanks @mhutter. it's not working on the actual .csv file i'm using BUT when i copy it and create a new file (vi newfile.csv) it is working. any idea why? – gafm Feb 15 '22 at 16:51
I'm gonna take a wild guess and assume the original file contains window line endings (CRLF instead of just LF). I amended the answer to account for that. – mhutter Feb 15 '22 at 23:03
1

Thank you @mhutter. Works like a charm. Much appreciated :) – gafm Feb 16 '22 at 12:48

markp-fuso · Answer 2 · 2022-02-15T17:22:24.430

One awk idea (replaces OP's current for loop):

awk -F',|\\\' '                                         # field delimiter of "," or "\"
FNR==NR { srlist[$1]
          next
        }
        { email=$NF
          if (email == "") next
          sr=$(NF-1)

          if (sr in srlist && emlist[sr] !~ email) {    # skip duplicate email addresses
                delim=(emlist[sr]) ? "," : ""
                emlist[sr]=emlist[sr] delim email
             }
        }
END     { for (sr in emlist)
              print sr " || To: " emlist[sr]
        }
' file1.txt file2.csv

This generates:

ABG$ || To: Geboi.torm@usdpwater.org,Josua.Durant@usdpwater.org
META GAIN CORP || To: Mary.White@usdpwater.org,Sed.Rasonn@usdpwater.org,Farah.Karlus@usdpwater.org,Mer.Sus@usdpwater.org

NOTES:

while a bit more typing than OP's current for loop, this approach requires a single scan of file2.awk and eliminates the 7 subprocess calls (for each pass through OP's for loop)
for any appreciable volume of data an awk solution should be noticeably faster
for the sample data provided:
- 0.65 secs: awk
- 1.80 secs: bash/for-loop

`emlist[sr] !~ email` would fail for various email addresses due to partial matches and regexp metachars like `.` matching. Instead of a regexp comparison with no anchors you need a string comparison with anchors (or a regexp one with anchors if you escape all possible regexp metachars). — Ed Morton, Feb 15 '22 at 22:40

score 0 · Answer 3 · answered Feb 15 '22 at 22:36

A shell loop is never the right approach for manipulating text, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice.

Using GNU awk for arrays of arrays:

$ cat tst.awk
BEGIN { FS="[\\\\,]" }
NR == FNR {
    tgts[$0]
    next
}
{
    sr = $(NF-1)
    email = $NF
}
(sr in tgts) && (email != "") {
    emails[sr][email]
}
END {
    for ( sr in emails ) {
        printf "%s || To:", sr
        sep = " "
        for ( email in emails[sr] ) {
            printf "%s%s", sep, email
            sep = ","
        }
        print ""
    }
}

$ awk -f tst.awk file1.txt file2.csv
ABG$ || To: Geboi.torm@usdpwater.org,Josua.Durant@usdpwater.org
META GAIN CORP || To: Mary.White@usdpwater.org,Farah.Karlus@usdpwater.org,Mer.Sus@usdpwater.org,Sed.Rasonn@usdpwater.org

It's not working as well if i'm using the original file. I guess it's because the original file contains window line endings (CRLF instead of just LF) like what @mhutter mentioned. — gafm, Feb 17 '22 at 18:21
@gafm then handle them in any of the usual ways, e.g. add `{sub(/\r$/,"")}` as the first line of the awk script or see https://stackoverflow.com/questions/45772525/why-does-my-tool-output-overwrite-itself-and-how-do-i-fix-it for other suggestions. — Ed Morton, Feb 17 '22 at 18:42

Search on the last column with \ delimiter and save the email address associated to it to a variable

3 Answers3