Comparing text blocks for similar content

Question

I have two text blocks, containing company names. Both contain names of hundreds of companies, the newer list has more companies. How do I remove the duplicate company names from the two lists , so that I am left with the new names only? Sample text block one:

Company Name One, Address line 1, line 2, phone, email
..
Random text 
..

Company Name Two 
Address, Phone,email
..
Random text 
..
Company Name 3 
Address, Phone,email

Sample text block two:

..Random Text..
M/s Company Name One Extra Random Text, Address line 1, line 2, phone, 
..random text...
M/s Company Name Two 
Address, Phone
...

The Company name, address etc are similar not same. The second block has the words M/s before all company names. I would like to do this in php, using regex perhaps.

I would like to out put company names which match, for eg. in the example given above I would like to output that Company names: Company name One, Company name two are common to both test blocks.

Update: thanks to @Wrikken, I have the text in two strings. I can explode the second block using the M/s, and get an array. How do I then check each item from this array to match the first text block which is one long string?

Although, I have since done the job manually, I would still like to know how two text blocks can be compared for similarity and hence the bounty.

Update: Output for @Joyce Babu code

..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone ... ... ... ... ... ... ... ... ... ... ... ...

Output for @nikic

array(2) { [0]=>  string(17) "..Random Text.. " [4]=>  string(16) "Address, Phone " }

Output for @Joyce Babu second post

andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,andom text...andom text...andom text...andom text...andom text...andom text...andom text...andom text...andom text...andom text...Company Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name Tworess, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phone

@Joyce Babu Final Code

<?php
set_time_limit(500);
$arOld = file('olddata.txt');
$arNew = file('newdata.txt');
$G=0;
    $c=0;

    foreach($arNew as $line){
    if(substr($line, 0, 4) == 'M/s '){
    $c++;   
    echo "<BR/>".$c.".)";
        $line = trim(substr($line, 4));
        foreach($arOld as $old){
            similar_text($line, $old, $percentage);
            if ($percentage > 80){
                continue;
            }
        }
        echo $line;
    }else{
    $G++;
    }
}
echo "<br/>".$G . " DID NOT MATCH";
?>

@ Output From Joyce Babu final code

1.)Company Name One Extra Random Text, Address line 1, line 2, phone,
2.)Company Name Two
4 DID NOT MATCH

Do the two text blocks look exactly as in your sample text? If that's the case, what do you want to keep? What do you want to remove? If that's NOT the case, I think you should post some sample text that looks exactly as your two text blocks. If you do, I may be able to help you out. — matsolof, Sep 23 '10 at 10:53
Is it one line per record? Can their be lines with random data between two records? — Joyce Babu, Oct 04 '10 at 11:35
Some records are on multiple lines, there is no order or uniformity, the only thing common to the text blocks are the Company names. — abel, Oct 04 '10 at 11:39
What token is used in block 1 to indicate the start of a company name? How do you know it's a company name and not random text? — bcosca, Oct 04 '10 at 19:28
@stillstanding nothing. no starting token for a company name in block 1. company name starts on a new line, although other stuff may start on a new line too. — abel, Oct 05 '10 at 09:14

score 2 · Answer 1 · answered Sep 23 '10 at 09:04

2

Create an array of both (possibly using the file() function, depending on the format of the text, or possibly just an explode() on content), and use array_diff().

answered Sep 23 '10 at 09:04

Wrikken

69,272
8
97
136

Joyce Babu · Accepted Answer · 2010-10-04T11:47:35.270

2

Try this

set_time_limit(500)
$arOld = file('olddata.txt');
$arNew = file('newdata.txt');
foreach($arNew as $line){
    if(substr($line, 0, 3) === 'M/s '){
        $line = trim(substr($line, 3));
        foreach($arOld as $old){
            similar_text($line, $old, $percentage);
            if ($percentage > 80){
                continue;
            }
        }
        echo $line;
    }
}

edited Oct 04 '10 at 11:47

answered Oct 04 '10 at 11:29

Joyce Babu

19,602
13
62
97

1

Can you answer my comment on the oringal post? – Joyce Babu Oct 04 '10 at 11:41
updated the code to check only lines beginning with M/S. Also fixed an error. – Joyce Babu Oct 04 '10 at 11:48
no output from the new code! I added a echo "Yes"; after line 5 – abel Oct 04 '10 at 12:35
... to check if the 'if cond' ever matches, even though there is an M/s at the beginning of many lines – abel Oct 04 '10 at 12:42
Oops! 'M/s ' is 4 characters. You need to change substr($line, 0, 3) to substr($line, 0, 4) – Joyce Babu Oct 04 '10 at 12:50
@joyce Babu Nice work. The work is not done yet, but enjoy the bounty! – abel Oct 05 '10 at 11:51
Thanks. I can't think of a perfect solution without a unique field. – Joyce Babu Oct 05 '10 at 15:07

score 1 · Answer 3 · edited May 23 '17 at 12:10

1

if you need to compare this lists only once, i'd suggest converting docs to txt and then you'll be able to compare using regex. otherwise you'll need to use third party software to access info in the docs... like here maybe Reading/Writing a MS Word file in PHP

edited May 23 '17 at 12:10

Community

1
1

answered Sep 23 '10 at 08:56

Sergey Eremin

10,994
2
38
44

not a public app. so I would only be too happy to copy paste into a form. – abel Sep 23 '10 at 08:59

NikiC · Answer 4 · 2010-09-28T14:27:49.930

1

$oldList = file('oldList.txt');
$newList = file('newList.txt');
$list = array_udiff($newList, $oldList, 'compare');

function compare($new, $old) {
    similar_text($old, substr($new, 3), $percent);
    return $percent >= 80 ? 0 : 1;
}

This is my basic idea. To find all texts similar by 80% and remove them from the $newList. You should adjust the percentage to satisfy your needs. The M/s is removed by substr($new, 3).

edited Sep 28 '10 at 14:27

answered Sep 28 '10 at 14:22

NikiC

100,734
37
191
225

thanks for the code. I get a 60s timeout when comparing the two blocks(each block is around 80kb) – abel Oct 04 '10 at 11:11
I added a var_dump($list); the output is posted in the original question – abel Oct 04 '10 at 11:32

Joyce Babu · Answer 5 · 2010-10-04T11:20:54.607

1

If there are no key fields for uniquely identifying the records, I think you will have to use something like similar_text or levenshtein.

$arOld = file('olddata.txt');
$arNew = file('newdata.txt');
foreach($arNew as $line){
   $line = trim(substr($line, 3));
   foreach($arOld as $old){
    similar_text($line, $old, $percentage);
    if ($percentage < 60){
        echo $line;
    }
   }
}

edited Oct 04 '10 at 11:20

answered Oct 04 '10 at 09:42

Joyce Babu

19,602
13
62
97

Undefined variable new on line 6 – abel Oct 04 '10 at 11:15
1

It is $line, not $new. Sorry. – Joyce Babu Oct 04 '10 at 11:19
I ran the script using samples from the orig post. the output is posted in the orig question – abel Oct 04 '10 at 11:26
1

On second thought, it is not going to work. It requires a little modification to work. Now it will print lots of lines. – Joyce Babu Oct 04 '10 at 11:27
yes it does print out a lot of lines. The principle would be to match everyword from one text block with all the words of the second word block and then echo those which match. However company namess are multiple words.... – abel Oct 04 '10 at 11:35

Comparing text blocks for similar content

5 Answers5