1

I have two text blocks, containing company names. Both contain names of hundreds of companies, the newer list has more companies. How do I remove the duplicate company names from the two lists , so that I am left with the new names only? Sample text block one:

Company Name One, Address line 1, line 2, phone, email
..
Random text 
..

Company Name Two 
Address, Phone,email
..
Random text 
..
Company Name 3 
Address, Phone,email

Sample text block two:

..Random Text..
M/s Company Name One Extra Random Text, Address line 1, line 2, phone, 
..random text...
M/s Company Name Two 
Address, Phone
...

The Company name, address etc are similar not same. The second block has the words M/s before all company names. I would like to do this in php, using regex perhaps.

I would like to out put company names which match, for eg. in the example given above I would like to output that Company names: Company name One, Company name two are common to both test blocks.

Update: thanks to @Wrikken, I have the text in two strings. I can explode the second block using the M/s, and get an array. How do I then check each item from this array to match the first text block which is one long string?

Although, I have since done the job manually, I would still like to know how two text blocks can be compared for similarity and hence the bounty.

Update: Output for @Joyce Babu code

..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. ..Random Text.. M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, M/s Company Name One Extra Random Text, Address line 1, line 2, phone, ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... ..random text... M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two M/s Company Name Two Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone Address, Phone ... ... ... ... ... ... ... ... ... ... ... ... 

Output for @nikic

array(2) { [0]=>  string(17) "..Random Text.. " [4]=>  string(16) "Address, Phone " } 

Output for @Joyce Babu second post

andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..andom Text..Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,Company Name One Extra Random Text, Address line 1, line 2, phone,andom text...andom text...andom text...andom text...andom text...andom text...andom text...andom text...andom text...andom text...Company Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name TwoCompany Name Tworess, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phoneress, Phone

@Joyce Babu Final Code

<?php
set_time_limit(500);
$arOld = file('olddata.txt');
$arNew = file('newdata.txt');
$G=0;
    $c=0;

    foreach($arNew as $line){
    if(substr($line, 0, 4) == 'M/s '){
    $c++;   
    echo "<BR/>".$c.".)";
        $line = trim(substr($line, 4));
        foreach($arOld as $old){
            similar_text($line, $old, $percentage);
            if ($percentage > 80){
                continue;
            }
        }
        echo $line;
    }else{
    $G++;
    }
}
echo "<br/>".$G . " DID NOT MATCH";
?>

@ Output From Joyce Babu final code

1.)Company Name One Extra Random Text, Address line 1, line 2, phone,
2.)Company Name Two
4 DID NOT MATCH
Community
  • 1
  • 1
abel
  • 2,377
  • 9
  • 39
  • 62
  • 1
    Do the two text blocks look exactly as in your sample text? If that's the case, what do you want to keep? What do you want to remove? If that's NOT the case, I think you should post some sample text that looks exactly as your two text blocks. If you do, I may be able to help you out. – matsolof Sep 23 '10 at 10:53
  • 1
    Is it one line per record? Can their be lines with random data between two records? – Joyce Babu Oct 04 '10 at 11:35
  • Some records are on multiple lines, there is no order or uniformity, the only thing common to the text blocks are the Company names. – abel Oct 04 '10 at 11:39
  • 1
    What token is used in block 1 to indicate the start of a company name? How do you know it's a company name and not random text? – bcosca Oct 04 '10 at 19:28
  • @stillstanding nothing. no starting token for a company name in block 1. company name starts on a new line, although other stuff may start on a new line too. – abel Oct 05 '10 at 09:14

5 Answers5

2

Create an array of both (possibly using the file() function, depending on the format of the text, or possibly just an explode() on content), and use array_diff().

Wrikken
  • 69,272
  • 8
  • 97
  • 136
2

Try this

set_time_limit(500)
$arOld = file('olddata.txt');
$arNew = file('newdata.txt');
foreach($arNew as $line){
    if(substr($line, 0, 3) === 'M/s '){
        $line = trim(substr($line, 3));
        foreach($arOld as $old){
            similar_text($line, $old, $percentage);
            if ($percentage > 80){
                continue;
            }
        }
        echo $line;
    }
}
Joyce Babu
  • 19,602
  • 13
  • 62
  • 97
1

if you need to compare this lists only once, i'd suggest converting docs to txt and then you'll be able to compare using regex. otherwise you'll need to use third party software to access info in the docs... like here maybe Reading/Writing a MS Word file in PHP

Community
  • 1
  • 1
Sergey Eremin
  • 10,994
  • 2
  • 38
  • 44
1
$oldList = file('oldList.txt');
$newList = file('newList.txt');
$list = array_udiff($newList, $oldList, 'compare');

function compare($new, $old) {
    similar_text($old, substr($new, 3), $percent);
    return $percent >= 80 ? 0 : 1;
}

This is my basic idea. To find all texts similar by 80% and remove them from the $newList. You should adjust the percentage to satisfy your needs. The M/s is removed by substr($new, 3).

NikiC
  • 100,734
  • 37
  • 191
  • 225
  • thanks for the code. I get a 60s timeout when comparing the two blocks(each block is around 80kb) – abel Oct 04 '10 at 11:11
  • I added a var_dump($list); the output is posted in the original question – abel Oct 04 '10 at 11:32
1

If there are no key fields for uniquely identifying the records, I think you will have to use something like similar_text or levenshtein.

$arOld = file('olddata.txt');
$arNew = file('newdata.txt');
foreach($arNew as $line){
   $line = trim(substr($line, 3));
   foreach($arOld as $old){
    similar_text($line, $old, $percentage);
    if ($percentage < 60){
        echo $line;
    }
   }
}
Joyce Babu
  • 19,602
  • 13
  • 62
  • 97
  • Undefined variable new on line 6 – abel Oct 04 '10 at 11:15
  • 1
    It is $line, not $new. Sorry. – Joyce Babu Oct 04 '10 at 11:19
  • I ran the script using samples from the orig post. the output is posted in the orig question – abel Oct 04 '10 at 11:26
  • 1
    On second thought, it is not going to work. It requires a little modification to work. Now it will print lots of lines. – Joyce Babu Oct 04 '10 at 11:27
  • yes it does print out a lot of lines. The principle would be to match everyword from one text block with all the words of the second word block and then echo those which match. However company namess are multiple words.... – abel Oct 04 '10 at 11:35