0

I have a string format which is like:

  • the word must be 15 letters long
  • first 8 letters are date

Example: '2009060712ab56c'

Let's say I want to compare this with another string and give a percentage of format similarity like:

result = format_similarity('2009060712ab56c', '20070908njndla56gjhk')

result is let's say in this case 80%.

Is there way of doing this?

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
s900n
  • 3,115
  • 5
  • 27
  • 35
  • 3
    What do you mean by "format similarity"? Is [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) enough? –  Nov 15 '18 at 12:39
  • Have you tried this https://stackoverflow.com/a/17388505/8835357 – specbug Nov 15 '18 at 12:41
  • Even easier, since - if I understand correctly - both strings are 15 characters long, simply iterate over the chars of both strings and count how many of them are equal. – quant Nov 15 '18 at 12:57
  • They aren't both 15 characters long. – Neil Nov 15 '18 at 13:01

2 Answers2

0

Your format consists of two different attributes which would be measured differently. How you combine those into a overall percentage similarity of format would be a business logic question. For example, if there is a missing number at the start, is it totally different now because it is no longer a date? Or is it still similar? But here is how you can get measurements:

import re 

def determine_similarity(string, other):
    length_string = len(string)  # use len to get the number of characters in the string
    length_other = len(other)
    number_of_numbers_string = _determine_number_of_numbers(string)
    number_of_numbers_other = _determine_number_of_numbers(other)

    <some logic here to create a metric of simiarity>
    <find the differences and divide them?>


LEADING_NUMBERS = re.compile(
    r"^"     # anchor at start of string
    r"[0-9]" # Must be a number
    r"+"     # One or more matches
)

def _determine_number_of_numbers(string):
    """
    Determine how many LEADING numbers are in a string
    """
    match = LEADING_NUMBERS.search(string)
    if match is not None:
        length = len(match.group()) # Number of numbers is length of number match group
    else:
        length = 0  # No match means no numbers

    <You might want to check whether the numbers constitute a date within a certain range or something like that>
    <For example, take the first four number and check whether the year is between 1980 and 2018>
    return length
Neil
  • 3,020
  • 4
  • 25
  • 48
0

As JETM pointed out in the comments, https://pypi.org/project/python-Levenshtein/ might be a good resource to compare the "closeness", i.e. edit distance of two strings (how many changes have to be made to one string to match the other).

You could create your own implementation of "edit distance" that matches your custom rules such as:

  • first 8 characters are numeric and form valid date
  • total string of 15 characters
jrsh
  • 371
  • 2
  • 12