How would I go about finding the most common substring in a file

Question

To preface, I am attempting to create my own compression method, wherein I do not care about speed, so lots of iterations over large files is plausible. However, I am wondering if there is any method to get the most common substrings of length of 2 or more (3 most likely), as any larger would not be plausible. I am wondering if you can do this without splitting, or anything like that, no tables, just search the string. Thanks.

If you don't want any "optimizations", doesn't the question effectively become _"how to find the most common substring in a string"_? Which already has 100's of answers everywhere? — Abhinav Mathur, Apr 15 '22 at 02:51
Please provide enough code so others can better understand or reproduce the problem. — Community, Apr 15 '22 at 05:51

score 1 · Accepted Answer · answered Apr 15 '22 at 01:09

You probably want to use something like collections.Counter to associate each substring with a count, e.g.:

>>> data = "the quick brown fox jumps over the lazy dog"
>>> c = collections.Counter(data[i:i+2] for i in range(len(data)-2))
>>> max(c, key=c.get)
'th'
>>> c = collections.Counter(data[i:i+3] for i in range(len(data)-3))
>>> max(c, key=c.get)
'the'

How would I go about finding the most common substring in a file

1 Answers1