String matches in python

Question

There are four files, a.txt, b.txt, c.txt, d.txt.

Each file has only one column of data that consists of names of shops/malls/restaurants etc. Essentially they are just names.

I need a program that can match the names in a.txt to names in each of the other three files (b.txt, c.txt, d.txt). By match, we mean the program should be able to mark a row in a.txt as matched if it contains the names that are available in any of the three other files. The matches need to be intelligent that is if some file has restaurant while the other doesn't it still should match. So we need to come up with some heuristics in order to do a good match.

I want matches that are perfect e.g. if a.txt has one of the following

Ivan Restaurant - Bukit Timah Road, Singapore
Ivan Restaurant - Bukit Timah Road, 12345 Singapore
Ivan Restaurant - Bukit Timah Road, 12345
Ivan Restaurant - 12345, Singapore 
Ivan Restaurant  Bukit Timah Road, Singapore
Ivan Restaurant  Bukit Timah Road, 12345 Singapore
Ivan Restaurant  Bukit Timah Road, 12345
Ivan Restaurant  12345, Singapore 
Ivan Restaurant ( Bukit Timah Road, Singapore)
Ivan Restaurant ( Bukit Timah Road, 12345 Singapore)
Ivan Restaurant ( Bukit Timah Road, 12345)
Ivan Restaurant ( 12345, Singapore)

or any such variation of "Ivan Restaurant" and b.txt or c.txt or d.txt has any of the following

Ivan
Ivan restaurant

Then, only the complete Ivan restaurant should match. However if there is no "Ivan restaurant" in b.txt or c.txt or d.txt but only Ivan is present there then you strip out the common words like restaurant from a.txt and then try to match.

I hope you get the idea. Similar approach for shops, buildings, malls etc. This is what I meant by heuristic.

If I understand your description correctly you can just build a `set()` with all the words of `b.txt`, `c.txt`, and `d.txt` and then loop over the words of `a.txt` and check if it is part of this set. If you need to know more info about the word, then you can use a `map`, that maps from the word to the relevant info, e.g. whether the word was in `b.txt` and from what line. — Dov Grobgeld, Dec 18 '11 at 08:02
@user1077645 - this site is for helping with problems you have with code you've written. If you want somebody to write a solution from scratch for you, try [Elance](https://www.elance.com/) or [vWorker](http://www.vworker.com/) or one of the myriad other such services. — Blair, Dec 18 '11 at 09:20
I have already tried it up by different methods, but not found any proper solution, thats why.. — Anoop, Dec 18 '11 at 11:27

score 3 · Answer 1 · edited May 23 '17 at 12:20

import contextlib

with contextlib.nested(open('b.txt', 'r'), open('c.txt', 'r'), open('d.txt', 'r')) as (b_fp, c_fp, d_fp):
    data = set(b_fp.readlines() +
               c_fp.readlines() +
               d_fp.readlines())

with open('a.txt', 'r') as fp:
   for line in fp:
       if line in data:
           print "Matched %s" % line.strip()

See: Multiple variables in Python 'with' statement for reference on the contextlib import.

As for a short explanation, first it reads in all the lines in b, c and d. It will put them in a set, which will basically eliminate duplicates. After that it will read through a.txt line by line and match it against the set. That strip on the print statement is used to strip of any trailing \n, you might want to do that before matching though.

Anyway, just tested it, and it seems to work.

score 1 · Accepted Answer · answered Dec 18 '11 at 09:32

Blubber's solution is excellent but may not satisfy your following criteria

or any such variation of "Ivan Restaurant" and b.txt or c.txt or d.txt has any of the following

Ivan Ivan restaurant

Then, only the complete Ivan restaurant should match. However if there is no "Ivan restaurant" in b.txt or c.txt or d.txt but only Ivan is present there then you strip out the common words like restaurant from a.txt and then try to match.

To make Blubber's solution work for you you might prefer to use difflib.get_close_matches. The algorithm tries to match with best of its ability. If you feel that is something that would not work for you, you might want to see how difflib works. Please note, heuristic matching is not an easy thing. There are libraries like Levenshtein that you might want to experiment with. But the one that will work for you totally depends on your acceptability criteria and data pattern. I would suggest to work with these libraries and see what suit best for you.

Just to expand Blubber's solution to incorporate difflib

import contextlib,difflib

with contextlib.nested(open('b.txt', 'r'), open('c.txt', 'r'), open('d.txt', 'r')) as (b_fp, c_fp, d_fp):
data = set(b_fp.readlines() +
           c_fp.readlines() +
           d_fp.readlines())

with open('a.txt', 'r') as fp:
   for line in fp:
       #if line in data:
       match = difflib.get_close_matches(line,data)
       if len(match) > 0:
           #print "Matched %s" % line.strip()
           print "({0}) matches with ({1})".format(line.strip(),match[0])

String matches in python

2 Answers2