I am currently working with a dataset that contains a chain of emails. Every email in the chain has a common footer, but the footer's content can vary from chain to chain. The footers are long, and they repeat throughout the emails in a specific chain. I am trying to remove these footers, but since they are not identical across all email chains, I cannot just use a static value for removal.
Therefore, I'm looking for a Pythonic way to programmatically identify the longest repeated substrings (which are the footers in my context) within each email chain (which is a string). The solution should find the longest repeated substring without any prior knowledge of what this substring might be. I've read other questions on this same topic but most of them either didn't work or used a brute force method which takes a lot of computing power and time.
This is what I've tried so far:
- Using a Suffix Tree (or Suffix Array) to find the longest substring but it's very resource intensive when computing large email chains like what I'm doing here.
- Implementing a pattern searching algorithm for string matching. However, this seems overly complex and inefficient for large email chains.
What I'm looking for is an efficient way to identify the longest repeating substrings in a string. Any help, insights or directions to useful resources are greatly appreciated. Thank you!