1

I have two lists, one of which contains wildcards (in this case represented by *). I would like to compare the two lists and create an output of those that match, with each wildcard * representing a single character.

For example:

File 1

123456|Jane|Johnson|Pharmacist|janejohnson@gmail.com
09876579|Frank|Roberts|Butcher|frankie1@hotmail.com
092362936|Joe|Jordan|Joiner|joe@joesjoinery.com
928|Bob|Horton|Farmer|bhorton@farmernews.co.uk

File 2

1***6|Jane|Johnson|Pharmacist|janejohnson@gmail.com
09876579|Frank|Roberts|Butcher|f**1@hotmail.com
092362936|Joe|Jordan|J*****|joe@joesjoinery.com
928|Bob|Horton|Farmer|b*****n@f*********.co.uk

Output

092362936|Joe|Jordan|Joiner|joe@joesjoinery.com
928|Bob|Horton|Farmer|bhorton@farmernews.co.uk

Explanation

The first two lines are not considered matches because the number of *s is not equal to the number of characters shown in the first file. The latter two are, so they are added to output.

I have tried to reason out ways to do this in AWK and using Join, but I don't know any way to even start trying to achieve this. Any help would be greatly appreciated.

janey1
  • 45
  • 4
  • On file 2 can the column with `*` can occur in any column or only in the last two? – Inian May 31 '19 at 09:00
  • This is just an example really, but in the file I'm working with it would only occur in the last column. – janey1 May 31 '19 at 09:09
  • Can you modify question to reflect only on last column? also add whatever efforts you made – Inian May 31 '19 at 09:09
  • Is it necessary to pursue this based on columns? The requirement would be for the entire line to match, not a single column. – janey1 May 31 '19 at 09:15

2 Answers2

2
$ cat tst.awk
NR==FNR {
    file1[$0]
    next
}
{
    # Make every non-* char literal (see https://stackoverflow.com/a/29613573/1745001):
    gsub(/[^^*]/,"[&]")  # Convert every char X to [X] except ^ and *
    gsub(/\^/,"\\^")     # Convert every ^ to \^

    # Convert every * to .:
    gsub(/\*/,".")

    # Add line start/end anchors
    $0 = "^" $0 "$"

    # See if the current file2 line matches any line from file1
    # and if so print that line from file1:
    for ( line in file1 ) {
        if ( line ~ $0 ) {
            print line
        }
    }
}

$ awk -f tst.awk file1 file2
092362936|Joe|Jordan|Joiner|joe@joesjoinery.com
928|Bob|Horton|Farmer|bhorton@farmernews.co.uk
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0
sed 's/\./\\./g; s/\*/./g' file2 | xargs -I{} grep {} file1

Explanation:

I'd take advantage of regular expression matching. To do that, we need to turn every asterisk * into a dot ., which represents any character in regular expressions. As a side effect of enabling regular expressions, we need to escape all special characters, particularly the ., in order for them to be taken literally. In a regular expression, we need to use \. to represent a dot (as opposed to any character).

The first step is perform these substitutions with sed, the second is passing every resulting line as a search pattern to grep, and search file1 for that pattern. The glue that allows to do this is xargs, where a {} is a placeholder representing a single line from the results of the sed command.

Note:

This is not a general, safe solution you can simply copy and paste: you should watch out for any characters, in your file containing the asterisks, that are considered special in grep regular expressions.


Update:

jhnc extends the escaping to any of the following characters: .\^$[], thus accounting for almost all sorts of email addresses. He/she then avoids the use of xargs by employing -f - to pass the results of sed as search expressions to grep:

sed 's/[.\\^$[]/\\&/g; s/[*]/./g' file2 | grep -f - file1

This solution is both more general and more efficient, see comment below.

simlev
  • 919
  • 2
  • 12
  • 26
  • `sed 's/[.\\^$[]/\\&/g; s/[*]/./g' file2 | grep -f - file1` (or `-f /dev/stdin` or write sed output to a temporary file if `-f -` is not recognised) – jhnc Jun 02 '19 at 02:35