Efficiently Identifying the Longest Repeated Substring from a String in Python

Question

I am currently working with a dataset that contains a chain of emails. Every email in the chain has a common footer, but the footer's content can vary from chain to chain. The footers are long, and they repeat throughout the emails in a specific chain. I am trying to remove these footers, but since they are not identical across all email chains, I cannot just use a static value for removal.

Therefore, I'm looking for a Pythonic way to programmatically identify the longest repeated substrings (which are the footers in my context) within each email chain (which is a string). The solution should find the longest repeated substring without any prior knowledge of what this substring might be. I've read other questions on this same topic but most of them either didn't work or used a brute force method which takes a lot of computing power and time.

This is what I've tried so far:

Using a Suffix Tree (or Suffix Array) to find the longest substring but it's very resource intensive when computing large email chains like what I'm doing here.
Implementing a pattern searching algorithm for string matching. However, this seems overly complex and inefficient for large email chains.

What I'm looking for is an efficient way to identify the longest repeating substrings in a string. Any help, insights or directions to useful resources are greatly appreciated. Thank you!

Can you share what you've tried, some sample data, and an explanation of why you think it didn't work? — erip, Jul 20 '23 at 15:44
Can you provide an example of what your data look like exactly and of the specific output you'd like for the example. I understand that you want the longest footer that appear at least twice within the input string, but it's hard to do without having an idea of what the input string look like. — Xiidref, Jul 20 '23 at 15:51
If what you're looking for are footers, they're supposed to be at the end of each email, aren't they? If so, this problem seems simpler than the generic repeated substring problem. — Swifty, Jul 20 '23 at 15:56
I would like to share a sample of an email I process, but I'm doing this for a company and I'm not allowed to share any of their emails. The email strings are just a bunch of emails in one big string, and there isn't a real way to go and find footers in those, the footers in my case are about a paragraph (its more of a disclaimer text than actual footer). — CaptainHaddock, Jul 20 '23 at 16:03

Efficiently Identifying the Longest Repeated Substring from a String in Python

0 Answers0