1

I want to have a good pattern matching code which can exactly match between both strings.

x = "Apple iPhone 6(Silver, 16 GB)"
y = "Apple iPhone 6 64 GB GSM Mobile Phone (Silver)"

Approach 1:
tmp_body = " ".join("".join([" " if ch in string.punctuation else ch.lower() for ch in y]).split())
tmp_body_1 = " ".join("".join([" " if ch in string.punctuation else ch.lower() for ch in x]).split())
if tmp_body in tmp_body_1:
    print "true"

In my problem x will always be a base string and y will change

Approach 2:
Fuzzy logic --> But was not getting good results through it

Approach 3:
Using regex which I don't know

I am still figuring out ways to solve it with regex.

  1. Removal of special characters from both base and incoming string
  2. Matches the GB and Color
  3. Splitting the GB from the number for good matching

These things I have figured out.

John Dene
  • 550
  • 1
  • 7
  • 21

2 Answers2

2

How about the following approach. Split each into words, lowercase each word and store in a set. x must then be a subset of y. So for your example it will fail as 16 does not match 64:

x = "Apple iPhone 6(Silver, 16 GB)"
y = "Apple iPhone 6 64 GB GSM Mobile Phone (Silver)"

set_x = set([item.lower() for item in re.findall("([a-zA-Z0-9]+)", x)])
set_y = set([item.lower() for item in re.findall("([a-zA-Z0-9]+)", y)])

print set_x
print set_y

print set_x.issubset(set_y)

Giving the following results:

set(['apple', '16', 'gb', '6', 'silver', 'iphone'])
set(['apple', 'mobile', 'phone', '64', 'gb', '6', 'gsm', 'silver', 'iphone'])
False

If 64 is changed to 16 then you get:

set(['apple', '16', 'gb', '6', 'silver', 'iphone'])
set(['apple', '16', 'mobile', 'phone', 'gb', '6', 'gsm', 'silver', 'iphone'])
True
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • @JohnDene i have one question which is very much similar to your problem i.e when you need to match your base string from a database of millions of records is it possible in a way you are using ? if yes, can you please provide me database query for such operation. – Rajendra Khabiya Oct 06 '15 at 05:41
  • What type of db you are using? I can tell you in mongodb .This result is not effective as I want but works well and million of records no problem working this theory on them @RajendraKhabiya – John Dene Oct 06 '15 at 16:49
  • I am using MySQL with Apache Solr and PHP .... Your suggestions are helpful for me... Thanks ! – Rajendra Khabiya Oct 07 '15 at 05:30
1

Looks like you're trying to do longest common substring here ofntwo unknown strings. Find common substring between two strings

Regex only works when you have a known pattern to your strings. You could use LCS to derive a pattern that you could use to test additional strings, but I don't think that's what you want.

If you are wanting to extract the capacity, model, and other information from these strings, you may want to use multiple patterns to find each piece of information. Some information may not be available. Your regular expressions will need to flex in order to handle a wider input (hard for me to assume all variations given a sample size of 2).

capacity = re.search(r'(\d+)\s*GB', useragent)
model = re.search(r'Apple iPhone ([A-Za-z0-9]+)', useragent)

These patterns won't make much sense to you unless you read the Python re module documentation. Basically, for capacity, I'm searching for 1 or more digits followed by 0 or more whitespace followed by GB. If I find a match, the result is a match object and I can get the capacity with match.group(). Similar story for finding iPhone version, though my pattern doesn't work for "6 Plus".

Since you have no control over the generation of these strings, if this is a script that you plan on using 3 years from now, expect to be a slave to it, updating the regular expression patterns as new string formats become available. Hopefully this is a one-off number crunching exercise that can be scrapped as soon as you answered your question.

Community
  • 1
  • 1
IceArdor
  • 1,961
  • 19
  • 20
  • This doesn't seem to solve my problem because the output I am getting is apple iphone 6,but on the other hand I want to match the 16 gb and silver color as awell – John Dene Aug 27 '15 at 06:32