0

How do I match the corpus file with second column in stem and return the first column?

corpus.txt
this
is
broken
testing
as
told

Only the fist 2 columns are important in this file:

stem.csv

"test";"tested";"test";"Suffix";"A";"7673";"321: 0 xxx"
"test";"testing";"test";"Suffix";"A";"7673";"322: 0 xxx"
"test";"tests";"test";"Suffix";"b";"5942";"001: 0 xxx"
"break";"broke";"break";"Suffix";"b";"5942";"002: 0 xxx"
"break";"broken";"break";"Suffix";"b";"5942";"003: 0 xxx"
"break";"breaks";"break";"Suffix";"c";"5778";"001: 0 xxx"
"tell";"told";"tell";"Suffix";"c";"5778";"002: 0 xx"

If the word is missing in the stem file, it should be replaced with XXX

expected.txt

XXX
XXX
break
test
XXX
tell

It can be done using SQL queries like this...

CREATE TABLE `stem` (
  `column1` varchar(100) DEFAULT NULL,
  `column2` varchar(100) DEFAULT NULL
) ;

INSERT INTO `stem` VALUES ('break','broken'),('break','breaks'),('test','tests');

CREATE TABLE `corpus` (
  `column1` varchar(100) DEFAULT NULL
) 

INSERT INTO `corpus` VALUES ('tests'),('xyz');
_____

    mysql> select ifnull(b.column1, 'XXX') as result from corpus as a left join stem as b on a.column1 = b.column2;
    +--------+
    | result |
    +--------+
    | test   |
    | XXX    |
    +--------+

But I am looking for a way to process text files directly so that I do not need to import them in mysql.

shantanuo
  • 31,689
  • 78
  • 245
  • 403
  • Are you sure that expected output is valid? – James Brown Apr 11 '21 at 08:34
  • Yes. The words are stemmed using the sql like query. – shantanuo Apr 11 '21 at 08:36
  • The word "broken" from corpus file is matched against the second column in stem file. If the match is found return the first column i.e. "break" or else "XXX". Apply this to each word in corpus file. – shantanuo Apr 11 '21 at 08:41
  • See [whats-the-most-robust-way-to-efficiently-parse-csv-using-awk](https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk) – Ed Morton Apr 11 '21 at 13:00

1 Answers1

3

Using awk:

$ awk -F';' '          # delimiter
NR==FNR {             # process the stem file
    gsub(/"/,"")      # off with the double quotes
    a[$2]=$1          # hash
    next
}
{
    if($1 in a)       # if corpus entry found in stem
        print a[$1]   # output
    else 
        print "XXX"
}' stem corpus 

Output:

XXX
XXX
break
test
XXX
tell
James Brown
  • 36,089
  • 7
  • 43
  • 59