To preface, I am attempting to create my own compression method, wherein I do not care about speed, so lots of iterations over large files is plausible. However, I am wondering if there is any method to get the most common substrings of length of 2 or more (3 most likely), as any larger would not be plausible. I am wondering if you can do this without splitting, or anything like that, no tables, just search the string. Thanks.
Asked
Active
Viewed 107 times
0
-
If you don't want any "optimizations", doesn't the question effectively become _"how to find the most common substring in a string"_? Which already has 100's of answers everywhere? – Abhinav Mathur Apr 15 '22 at 02:51
-
Please provide enough code so others can better understand or reproduce the problem. – Community Apr 15 '22 at 05:51
1 Answers
1
You probably want to use something like collections.Counter
to associate each substring with a count, e.g.:
>>> data = "the quick brown fox jumps over the lazy dog"
>>> c = collections.Counter(data[i:i+2] for i in range(len(data)-2))
>>> max(c, key=c.get)
'th'
>>> c = collections.Counter(data[i:i+3] for i in range(len(data)-3))
>>> max(c, key=c.get)
'the'

Samwise
- 68,105
- 3
- 30
- 44