-1

I wanted to see what methods there are in Python that can compare strings like this

'replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use'

and

'replaced scanner'

Suppose there is a consensus that the longer string should be replaced with the shorter one. I am trying to use some method that would be able to compare the longer string with the shorter one.

I have tried

text = 'replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use'
if "replaced scanner" in text:
    print("Yes")

and

sr = pd.Series(['replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use'])
sr.str.contains('replaced scanner')

Both these approaches does not work in the way I want. I obviously need some method that I will need to use in a consistent manner for other strings besides the example above. Any suggestions are appreciated.

To edit more since this is getting downvoted and explain more of a context. I am trying to cluster strings together using the difflib library. Yes I have tried clustering and that gets me no where fast. In certain cases there are strings like the long one I posted that contain another string from another cluster group. Ideally I would want the longer string to be bucketed in the shorter one but since its long and the other is short they do not have a good ratio matching.

Therefore, what I am trying to do is look for the cluster groups that have say less than some frequency count in the pandas column and compare it with the ones that do have a larger frequency count. If that less frequency count string matches with the string that has a larger frequency count then I would bucket it into its correct position.

Hence, I am looking for a method that achieves what I am trying to do. I hope that is making sense. I can provide more context if its unclear.

justanewb
  • 133
  • 4
  • 15
  • `sr.str.contains('replaced.*?scanner')` since `str.contains` uses a regex – DeepSpace Mar 03 '21 at 18:51
  • We need you to specify the functionality you expect from this method. The example is nice, but it doesn't explain the range of functionality that you want. – Prune Mar 03 '21 at 18:52
  • Does this help ..[How to find a similar substring inside a large string with a similarity score in python?](https://stackoverflow.com/questions/48117508/how-to-find-a-similar-substring-inside-a-large-string-with-a-similarity-score-in) – DarrylG Mar 03 '21 at 18:54
  • 1
    Have you tried the partial_ratio from [fuzzywuzzy partial ratio](https://stackoverflow.com/questions/31806695/when-to-use-which-fuzz-function-to-compare-2-strings/31823872). It tries to find the best match of a shorter string in a larger string. – DarrylG Mar 03 '21 at 19:10

1 Answers1

1

Your comparison will never match. The only way that your

if 'replaced scanner' in text:
  print('Yes')

would work is if the full string actually contained this. If you notice in the full string, you have 'replaced the scanner ...'

So the string would have to have a perfect comparison for this statement to work. If you are wanting to in fact use this example, you could use difflib to get a comparison metric ratio and use the ratio to determine if you'd like to replace the string or not. See https://stackoverflow.com/a/17388505/8645056

LimeSlice
  • 38
  • 5