3

I've a problem in making a PERL program for matching the words in two documents. Let's say there are documents A and B.

So I want to delete the words in document A that's not in the document B.

Example 1:

A: I eat pizza

B: She go to the market and eat pizza

result: eat pizza

example 2: A: eat pizza

B: pizza eat

result:pizza (the word order is relevant, so "eat" is deleted.)

I use Perl for the system and the sentences in each document isn't in a big numbers so I think I won't use SQL

And the program is a subproram for automatic essay grading for Indonesian Language (Bahasa)

Thanx, Sorry if my question is a bit confusing. I'm really new to 'this world' :)

DVK
  • 126,886
  • 32
  • 213
  • 327
Randy
  • 33
  • 6
  • Do you care about word order at all? E.g. do you care if the result in example 2 will be 2 lines, with words "eat" and "pizza" on separate lines? – DVK May 24 '10 at 01:33
  • No,I've made it so everything is on the same line. – Randy May 24 '10 at 01:41
  • How far have you gotten so far? What part are you having difficulty with? – Ether May 24 '10 at 02:31
  • Correct me if I'm wrong, but it seems that you're using this to determine the extent to which essays match with one another. – Zaid May 24 '10 at 08:15
  • @Zaid - that'd be ny guesstimate as well :) – DVK May 24 '10 at 13:40
  • @Ether I think, I've solved this, but if you have any comment, feel free..^^ – Randy May 24 '10 at 13:41
  • @Zaid: Yes but no,ha3..I'll compare the system that use LSA for document similarity and the system that use LSA+word order..because so far as I know, LSA don't care about the word order..(NLP or word order like this). So to what extent the essays match with one another still use LSA. – Randy May 24 '10 at 13:42

1 Answers1

1

OK, I'm without access at the moment so this is not guaranteed to be 100% or even compile but should provide enough guidance:

Solution 1: (word order does not matter)

#!/usr/bin/perl -w

use strict;
use File::Slurp;

my @B_lines = File::Slurp::read_file("B") || die "Error reading B: $!";
my %B_words = ();
foreach my $line (@B_lines) {
    map { $B_words{$_} = 1 } split(/\s+/, $line);
}
my @A_lines = File::Slurp::read_file("A") || die "Error reading A: $!";
my @new_lines = ();
foreach my $line (@A_lines) {
    my @B_words_only = grep { $B_words{$_} } split(/\s+/, $line);
    push @new_lines, join(" ", @B_words_only) . "\n";
}
File::Slurp::write_file("A_new", @new_lines) || die "Error writing A_new: $!";

This should create a new file "A_new" that only contains A's words that are in in B.

This has a slight bug - it will replace any multiple-whitespace in file A with a single space, so

    word1        word2              word3

will become

word1 word2 word3

It can be fixed but would be really annoying to do so, so I didn't bother unless you will absolutely require that whitespace be preserved 100% correctly

Solution 2: (word order matters BUT you can print words from file A out with no regards for preserving whitespace at all)

#!/usr/bin/perl -w

use strict;
use File::Slurp;

my @A_words = split(/\s+/gs, File::Slurp::read_file("A") || die "Error reading A:$!");
my @B_words = split(/\s+/gs, File::Slurp::read_file("B") || die "Error reading B:$!");
my $B_counter = 0;
for (my $A_counter = 0; $A_counter < scalar(@A_words); ++$A_counter) {
    while ($B_counter < scalar(@B_words)
        && $B_words[$B_counter] ne $A_words[$A_counter]) {++$B_counter;}
    last if $B_counter == scalar(@B_words);
    print "$A_words[$A_counter]";
}

Solution 3 (why do we need Perl again? :) )

You can do this trivially in shell without Perl (or via system() call or backticks in parent Perl script)

comm -12 A B | tr "\012" " " 

To call this from Perl:

my $new_text = `comm -12 A B | tr "\012" " " `;

But see my last comment why this may be considered "bad Perl"... at least if you do this in a loop with very many files being iterated and care about performance.

DVK
  • 126,886
  • 32
  • 213
  • 327
  • OK, I just saw your second example and will try to fix for that... it's a bit more complicated this way if the word order matters – DVK May 24 '10 at 01:32
  • Ha3..sorry for the edit..It's a bit confusing since my first time using Perl but a big thanks for the reply.. :) – Randy May 24 '10 at 01:37
  • @Randy - please see my question in the comment. Do you really care about how the common words are output? – DVK May 24 '10 at 01:38
  • No, if the question is about the line, I've made it so the document just have one line. – Randy May 24 '10 at 01:42
  • emm..the common words or stopwords in the sentence has been removed, so it's just he important words left – Randy May 24 '10 at 01:44
  • @Randy - I mean, if the answer will be 1 word per line, is that OK? – DVK May 24 '10 at 01:46
  • @Randy - OK, see my solution #2 for Perl version and #3 for shell command... the latter's a lot more consise but is not Good Perl Practice as it will spawn off 2 separate child processes which is bad for performance if it happens many times in a loop. – DVK May 24 '10 at 01:50
  • @DVK..you code really fast..it's really a long way to go for me.. :) anw, I'll implement it to my program and give the news soon. Thank you Randy – Randy May 24 '10 at 01:51
  • @Randy - artifact of having competed in programming contests... but in this case I'd say hold the compliments till you verify that the code actually works since I coulnd't test it :) – DVK May 24 '10 at 01:53
  • @DVK it works..thanks!I modify it a little so I can use it with Matlab.. :) thanks again – Randy May 24 '10 at 13:37
  • @Randy - you're welcome. Feel free to indicate whether the answer was helpful by StackOPverflow standard methods: (1) Up-voting the answer (up-arrow next to it) and "accepting" (Checkmark next to the asnwer). Cheers, and welcome to wonderful world of Perl, where possible things are easy and impossible things are doable :) – DVK May 24 '10 at 13:42