match a column and return some other column like sql

Question

How do I match the corpus file with second column in stem and return the first column?

corpus.txt
this
is
broken
testing
as
told

Only the fist 2 columns are important in this file:

stem.csv

"test";"tested";"test";"Suffix";"A";"7673";"321: 0 xxx"
"test";"testing";"test";"Suffix";"A";"7673";"322: 0 xxx"
"test";"tests";"test";"Suffix";"b";"5942";"001: 0 xxx"
"break";"broke";"break";"Suffix";"b";"5942";"002: 0 xxx"
"break";"broken";"break";"Suffix";"b";"5942";"003: 0 xxx"
"break";"breaks";"break";"Suffix";"c";"5778";"001: 0 xxx"
"tell";"told";"tell";"Suffix";"c";"5778";"002: 0 xx"

If the word is missing in the stem file, it should be replaced with XXX

expected.txt

XXX
XXX
break
test
XXX
tell

It can be done using SQL queries like this...

CREATE TABLE `stem` (
  `column1` varchar(100) DEFAULT NULL,
  `column2` varchar(100) DEFAULT NULL
) ;

INSERT INTO `stem` VALUES ('break','broken'),('break','breaks'),('test','tests');

CREATE TABLE `corpus` (
  `column1` varchar(100) DEFAULT NULL
) 

INSERT INTO `corpus` VALUES ('tests'),('xyz');
_____

    mysql> select ifnull(b.column1, 'XXX') as result from corpus as a left join stem as b on a.column1 = b.column2;
    +--------+
    | result |
    +--------+
    | test   |
    | XXX    |
    +--------+

But I am looking for a way to process text files directly so that I do not need to import them in mysql.

The word "broken" from corpus file is matched against the second column in stem file. If the match is found return the first column i.e. "break" or else "XXX". Apply this to each word in corpus file. — shantanuo, Apr 11 '21 at 08:41
See [whats-the-most-robust-way-to-efficiently-parse-csv-using-awk](https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk) — Ed Morton, Apr 11 '21 at 13:00

James Brown · Accepted Answer · 2021-04-11T10:02:15.480

3

Using awk:

$ awk -F';' '          # delimiter
NR==FNR {             # process the stem file
    gsub(/"/,"")      # off with the double quotes
    a[$2]=$1          # hash
    next
}
{
    if($1 in a)       # if corpus entry found in stem
        print a[$1]   # output
    else 
        print "XXX"
}' stem corpus

Output:

XXX
XXX
break
test
XXX
tell

edited Apr 11 '21 at 10:02

answered Apr 11 '21 at 09:32

James Brown

36,089
7
43
59

Hadn't had my morning coffee first, hence the output question. I was thinking it the other way around. – James Brown Apr 11 '21 at 09:33
1

Thanks. This is correct. Is it possible to display the original value as well? I will like to see the word behind XXX. – shantanuo Apr 11 '21 at 10:05
1

Yes, just replace `print "XXX"` with `print "XXX", $1`. – James Brown Apr 11 '21 at 10:08

match a column and return some other column like sql

1 Answers1