I wanted to see what methods there are in Python that can compare strings like this
'replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use'
and
'replaced scanner'
Suppose there is a consensus that the longer string should be replaced with the shorter one. I am trying to use some method that would be able to compare the longer string with the shorter one.
I have tried
text = 'replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use'
if "replaced scanner" in text:
print("Yes")
and
sr = pd.Series(['replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use'])
sr.str.contains('replaced scanner')
Both these approaches does not work in the way I want. I obviously need some method that I will need to use in a consistent manner for other strings besides the example above. Any suggestions are appreciated.
To edit more since this is getting downvoted and explain more of a context. I am trying to cluster strings together using the difflib library. Yes I have tried clustering and that gets me no where fast. In certain cases there are strings like the long one I posted that contain another string from another cluster group. Ideally I would want the longer string to be bucketed in the shorter one but since its long and the other is short they do not have a good ratio matching.
Therefore, what I am trying to do is look for the cluster groups that have say less than some frequency count in the pandas column and compare it with the ones that do have a larger frequency count. If that less frequency count string matches with the string that has a larger frequency count then I would bucket it into its correct position.
Hence, I am looking for a method that achieves what I am trying to do. I hope that is making sense. I can provide more context if its unclear.