There are four files, a.txt
, b.txt
, c.txt
, d.txt
.
Each file has only one column of data that consists of names of shops/malls/restaurants etc. Essentially they are just names.
I need a program that can match the names in a.txt
to names in each of the other three files (b.txt
, c.txt
, d.txt
). By match, we mean the program should be able to mark a row in a.txt
as matched if it contains the names that are available in any of the three other files. The matches need to be intelligent that is if some file has restaurant while the other doesn't it still should match. So we need to come up with some heuristics in order to do a good match.
I want matches that are perfect e.g. if a.txt
has one of the following
Ivan Restaurant - Bukit Timah Road, Singapore
Ivan Restaurant - Bukit Timah Road, 12345 Singapore
Ivan Restaurant - Bukit Timah Road, 12345
Ivan Restaurant - 12345, Singapore
Ivan Restaurant Bukit Timah Road, Singapore
Ivan Restaurant Bukit Timah Road, 12345 Singapore
Ivan Restaurant Bukit Timah Road, 12345
Ivan Restaurant 12345, Singapore
Ivan Restaurant ( Bukit Timah Road, Singapore)
Ivan Restaurant ( Bukit Timah Road, 12345 Singapore)
Ivan Restaurant ( Bukit Timah Road, 12345)
Ivan Restaurant ( 12345, Singapore)
or any such variation of "Ivan Restaurant"
and b.txt
or c.txt
or d.txt
has any of the following
Ivan
Ivan restaurant
Then,
only the complete Ivan restaurant should match. However if there is no "Ivan restaurant" in b.txt
or c.txt
or d.txt
but only Ivan is present there then you strip out the common words like restaurant from a.txt
and then try to match.
I hope you get the idea. Similar approach for shops, buildings, malls etc. This is what I meant by heuristic.