13

Let's say I have 2 strings

AAABBBCCCCC

and

AAAABBBBCCCC

to make these strings as similar as possible, given that I can only remove characters I should

  • delete the last C from the first string
  • delete the last A and the last B from the second string,

so that they become

AAABBBCCCC

What would be an efficient algorithm to find out which characters to remove from each string?

I'm currently crushing my brain cells thinking about a sollution involving substrings of the strings, looking for them i,n the other string.

svick
  • 236,525
  • 50
  • 385
  • 514
bigblind
  • 12,539
  • 14
  • 68
  • 123
  • Does the order of the characters to be removed matter? For example, do you have to know that it's the 4th A and last C that are to be removed, or do you just need know that there's one A and one C to be removed? – Nadh May 06 '12 at 10:59
  • If the order of characters to be removed don't matter, wouldn't sorting both strings, and subtracting the smaller one from the bigger work? – Nadh May 06 '12 at 11:00
  • the order doesn't matter within groups of the same group of the same characters, for example in the string `ÀABBAA` removing the first character would be the same as removing the second, but removing the first character is not the same as removing the last one. – bigblind May 06 '12 at 11:02
  • What is the expected end result, positions of characters to be removed(0th, 3rd etc.), or the extra characters themselves without order (A,A,C,A,C..)? – Nadh May 06 '12 at 11:05
  • rel: http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison – georg May 06 '12 at 11:22
  • 3
    For a general solution, you will have to find the [longest common subsequence](http://en.wikipedia.org/wiki/Longest_common_subsequence_problem), which is a nice problem to be solved with dynamic programming :) – Niklas B. May 06 '12 at 12:37
  • @sys.stderr, I earlier gave you +1 for an interesting question and have followed progress; it would be great to see how you coded it in the end. Put your solution in an edit to your question, for completeness? – daedalus May 07 '12 at 04:20

4 Answers4

15

Levenshtein distance can calculate how many changes you need to convert one string into another. A small change to the source, and you may get not only distance, but the conversions needed.

lenik
  • 23,228
  • 4
  • 34
  • 43
  • Levenshtein also allows modifications of single characters, which isn't allowed by OP's description. – Niklas B. May 06 '12 at 12:35
  • using levehshtein as base, you may introduce or remove any character modifications you like or don't like. – lenik May 06 '12 at 12:42
  • Yes, but then it might be that you don't find the optimal number of removals. – Niklas B. May 06 '12 at 12:43
  • 1
    levenstein deals with 1)char removed 2)char added 3)char replaced, first one means "remove character from the first string", second one means "remove characted from the second string", third one means "remove character from both strings", please, open a separate question if you need further clarification. – lenik May 06 '12 at 12:48
  • I just realized that the algorithm can be configured to assign different costs to the different operations. In our case we'd have to assign costs 2 to "char replaced". – Niklas B. May 06 '12 at 12:56
14

How about using difflib?

import difflib

s1 = 'AAABBBCCCCC'
s2 = 'AAAABBBBCCCC'

for difference in difflib.ndiff(s1, s2):
    print difference,
    if difference[0] == '+':
        print 'remove this char from s2'
    elif difference[0] == '-':
        print 'remove this char from s1'
    else:
        print 'no change here'

This will print out the differences between the two strings that you can then use to remove the differences. Here is the output:

  A no change here
  A no change here
  A no change here
+ A remove this char from s2
+ B remove this char from s2
  B no change here
  B no change here
  B no change here
  C no change here
  C no change here
  C no change here
  C no change here
- C remove this char from s1
daedalus
  • 10,873
  • 5
  • 50
  • 71
  • 1
    So using this the solution would be something like this `''.join(s[2] for s in difflib.ndiff(s1,s2) if s[0] == ' ')` – jamylak May 06 '12 at 11:14
  • 1
    +1 for concise reformulation. The longer version might be useful if the removal of the extra chars should trigger other actions as well. – daedalus May 06 '12 at 11:17
1

Don't know if it's the fastest, but as code goes, it is at least short:

import difflib
''.join([c[-1] for c in difflib.Differ().compare('AAABBBCCCCC','AAAABBBBCCCC') if c[0] == ' '])
deinonychusaur
  • 7,094
  • 3
  • 30
  • 44
0

I think regular expression can do this.It's a string distance problem. I mean. Let's have two string:

str1 = 'abc'
str2 = 'aabbcc'

first, I choose the short, and construct a regular expression like is:

regex = '(\w*)'+'(\w*)'.join(list(str1))+'(\w*)'

Then, we can search:

matches = re.search(regex,str2)

I use round brackets to group the section I am interested. these groups of matches.group() is the distance of two strings.Next, I can figure out what characters should be removed. It's my idea, I hope it can help you.

hakunami
  • 2,351
  • 4
  • 31
  • 50