2

I am having trouble trying to write a search engine that treats all inflections of a word as the same basic word.

  1. So for verbs these are all the same root word, be:
    • number/person (e.g. am; is; are)
    • tense/mood like past or future tense (e.g. was; were; will be)
    • past participles (e.g. has been; had been)
    • present participles and gerunds (e.g. is being; wasn't being funny; being early is less important than being correct)
    • subjunctives (e.g. might be; critical that something be finished; I wish it were) ⁠ ⁠ ⁠

  2. Then for nouns, both the singular form and the plural form should count as the same basic word [ᴇᴅɪᴛᴏʀ's ɴᴏᴛᴇ: this is frequently referrred to as the citation form of the word.]

For example, with “enable”, I don’t want “enables” and “enabled” printed as separate entries. All three of those should count as the same basic word, the verb enable.

I can prevent printing of duplicates using a hash like:

unless ($seenmatches{ $headmatches[$l] }++)
  1. Could someone explain this? Explained in comments below.

  2. This doesn’t stop the plural/past from continuing on. Is there a way to do this, or some wholely distinct approach, perhaps one involving a regex and/or substitution, then an unsub later?

I can't modify the word with a substitution, because then the print would not print out right. Although I'm not at the stage yet, eventually I'd like to include irregular past tenses [ᴇᴅɪᴛᴏʀ's ɴᴏᴛᴇ: and irregular nouns, too?] as well

Im not sure what else you need to answer my question, so please just let me know anything I’ve unintentionally left out, and I'll fill in any missing bits to help make it clearer.

NullUserException
  • 83,810
  • 28
  • 209
  • 234
Jon
  • 757
  • 5
  • 20
  • UPDATE: Answer to Q#1: http://stackoverflow.com/questions/7651/how-do-i-remove-duplicate-items-from-an-array-in-perl – Jon May 31 '11 at 17:37
  • **ɴᴏᴛᴀ ʙᴇɴᴇ:** There’s also a Snowball stemmer available for Perl, not just a Porter stemmer. I’ve used both pretty extensively. One place they're both weak is deriving citation forms for irregular noun plurals of classical (*viz.* Latin and Greek) origin. The Latin-derived words are hit and miss, but the Greeks are almost always miss. The [Lingua::EN::Inflect](http://search.cpan.org/perldoc?Lingua::EN::Inflect) does a good bit better on these, especially from SG→PL. I’ve a set of ~5k irregular SG→PL nouns from the ᴏᴇᴅ, plus 400+ SG↔PL noun rules. Send me mail and I’ll send these all to you. – tchrist Jun 01 '11 at 16:28
  • @tchrist: That sounds amazing! The current stemmer weren't accurate enough to use. How would I send you mail? – Jon Jun 01 '11 at 16:41
  • You would send me mail in precisely the same fashion that the unhyperbolically 40,000–75,000 other senders who send me mail every single day use. I don’t believe in cowardly hiding behind fake email addresses, because **as soon as you do that, the terrorists (*nés* spammers) have won.** If I were obfuscatory, I might perhaps use⦅ †̶͒☧ͅſt ⦆I suppose, but I just don’t believe in obfusclutterhetoricizationalisms. So just use    <@。> and maybe mention in the subject line so that I notice it. – tchrist Jun 01 '11 at 19:03
  • Also, the stanford POS parser and tagger have morphology... just found out – Jon Jun 02 '11 at 17:08

4 Answers4

5

The way a typical search engine works is as follows:

  • The input string is tokenized, chopped up at word boundaries - a character offset start/end is associated with each token
  • Each token is then stemmed - I'd use Lingua::Stem (or, better, Lingua::Stem::Snowball) which are slightly updated versions of the Porter stemmer
  • Each token, with its original character offset start/end is retained and indexed, usually along with a copy of the original text, before it was tokenized. This is basically a table which associates the term text with its originating document (usually as an identifier)

Now, when a query arrives, it too is tokenized and each token stemmed, but we don't care about positions this time. We look up each token against those we have indexed, to locate the postings (matching document identifiers). We can now retrieve the stored start/end offsets to determine where the terms were in the original text.

So, you do lose the suffixes for the index (which is what it used to locate matching documents) but you preserve the original text and offsets for those documents, so you can do query highlighting and nice display stuff should you need to.

Stemming is definitely the right tool for this job. The main trick is to ensure you treat the query and the documents in the same way. You can modify the original document, but really, you want to transform it into something like a back of book index, not into a string you use regular expressions on -- if you really are doing search engine stuff, that is. Check out the excellent KinoSearch module on CPAN if you like, or look at the Apache Lucene project it was originally derived from.

Stuart Watt
  • 5,242
  • 2
  • 24
  • 31
  • Wow, I wish I knew about KinoSearch and other english modules earlier, I'm almost finished now... I'll see if I can implement Lingua::Stem. Thanks for the great answer. – Jon May 31 '11 at 19:25
  • Okay... I couldn't do it. I'm not able to use a scalar, which Lingua::Stem gives. I'll see if I can modify the script – Jon May 31 '11 at 20:35
  • @morungos Have you modified either Porter or Snowball for all the Latin and especially the Greek words which both are unacceptably weak at? I’ve been using them on the biomedical text in the *PubMed Open Access* corpus. I’ve been quite disappointed: it seems that those two stemmers are only good for newswire text, not scientific text. What we need is a stronger lemmatizer with a richer set of bound morphemes and derivational rules. Are you acquainted with any such work? It needn’t be in Perl, either. But I’ve found nothing I consider adequate. This is so important I feel I must’ve missed it. – tchrist Jun 01 '11 at 16:39
  • 2
    @tchrist No, haven't touched Latin or Greek. My work wasn't in a domain that used stemming much anyway. The Snowball page points to some work on Latin -- http://snowball.tartarus.org/otherapps/schinke/intro.html, but it appears not to have made it into any distributions. For Greek, there are a few references to some Masters work -- http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.99.623&rep=rep1&type=pdf -- even a JavaScript implementation, but you're right, support for these appears universally weak. They would be very useful. I'll ask around some information retrieval folks – Stuart Watt Jun 01 '11 at 21:06
  • @Jon is the scalar result just the returned array reference? If you need a list, you probably need to dereference it e.g., foreach $word (@$stemmed) ... – Stuart Watt Jun 01 '11 at 21:10
  • @morungos ya that was the problem. Thanks a lot, had a bit more learning to do haha – Jon Jun 01 '11 at 21:24
  • Lingua::Stem and Lingua::Stem::Snowball are both useless if you need common words, as neither of them has a list of common English words and their tenses. Neither of them can handle 'is' (Snowball changes it to 'i'), 'be', 'had', 'said', or 'was' (Lingua::Stem changes it to 'wa'). – Phil Goetz May 31 '15 at 13:45
  • @Phil Goetz Yes, definitely -- if you need these words. Even as you suggest, to identify tenses like passives. It usually comes back to what your application is intended to do. Sometimes you just drop common words before you get to stemming. Sometimes you handle exceptions by lookup table (and Lingua::Stem allows this). It'd be reasonable to write a wrapper/proxy around a stemmer to handle this if needed. – Stuart Watt Jun 01 '15 at 18:50
1

The Text::English module includes a Porter stemmer, which is the usual method of treating different forms of the same word as identical for matching purposes.

Wooble
  • 87,717
  • 12
  • 108
  • 131
  • Is that the only way? Because it eliminates the suffix, how would I get the suffix back or only use it in the seen hash? – Jon May 31 '11 at 18:24
  • Well I tried using Text::English, however there seems to be a small problem. It includes some words which aren't even close like able->abl but when i use the regex matching the stem, water and sediments come up as matching??? Probably a problem on my end, since I was relying on 2 separate for loops so I'll keep trying... – Jon May 31 '11 at 20:38
1

Check out verbTenseChanger.pl (http://cogcomp.cs.illinois.edu/page/tools_view/1) Here's the readme:

##codes for the various tenses are:
#0 - Base Form
#1 - Past Simple
#2 - Past Participle
#3 - 3rd Person Singular
#4 - Present Participle

##Example use:
##my $newTense = changeVerbForm("see",0,4);
##changes tense from base form to the present participle

I used this (which I guess includes a stemmer) by creating the different forms:

my @changeverbforms = map changeVerbForm( $search_key, 0, $_ ), 1..4;
my @verbforms;
push (@verbforms, $changeverbforms[0]) unless ($changeverbforms[0] eq "");
push (@verbforms, $changeverbforms[1]) unless ($changeverbforms[1] eq "");
push (@verbforms, $changeverbforms[2]) unless ($changeverbforms[2] eq "");
push (@verbforms, $changeverbforms[3]) unless ($changeverbforms[3] eq "");

and then looping through @verbforms (around entire search engine perl code) and everywhere I had $search_key, I also put or $verbform. There were a few extra things to fix, but that's the general implementation (albeit to my specific circumstances)

For some debugging of the faulty online code, see: https://stackoverflow.com/questions/6459085/need-help-understanding-this-verb-tense-changing-code-please

Community
  • 1
  • 1
Jon
  • 757
  • 5
  • 20
0

I tried Lingua::Stem, Lingua::Stem::Snowball, and WordNet::stem, and they all fail to stem most common words. To get those simple words, you can run this simple stemmer afterwards, which uses WordNet's .exc (exception?) files:

1. Download and install WordNet.
2. export WNHOME='/usr/lib/wnres' (if that is the directory containing the dict directory; that's where Cygwin puts it. You'll need that to install Wordnet::QueryData.)
3. cat $WNHOME/dict/*.exc > wordnet.exc  (combine all the .exc files)
4. Make this perl file:

$ cat > stem.pl
use strict;
use warnings;

# Read in WordNet exception files
my $ExcFile = "wordnet.exc";
my %Stems;
open(my $FILE, "<$ExcFile") or die "Could not read $ExcFile: $!";
while (my $line = <$FILE>) {
        chomp($line);
        my ($word, $stem) = split(/\s+/, $line);
        $Stems{$word} = $stem;
}
close($FILE);

while (defined(my $in = <>)) {
        chomp($in); $in =~ s/\r$//;
        $in =~ s/^\s+//;
        $in =~ s/\s+$//;
        next if $in eq '';
        my @words = split(/\s+/, $in);
        foreach my $w (@words) {
                $w = $Stems{$w} if $Stems{$w};
        }
        print "@words\n";
}
<ctrl-D>

Then you can stem foo.txt with

perl stem.pl < foo.txt

You may want to run the other stemmers before rather than after this step, because if they're smart and use word context to stem (though I doubt they do), they'll need the full unstemmed line to work with, whereas stem.pl works word-by-word.

Phil Goetz
  • 549
  • 4
  • 14