1

I'm trying to create a script that looks through a list of strings files and reports on the sub-strings that are most common between them.

For example:

  1. Hello, I am string one. I like apples and oranges. We are all strings here.
  2. Hello, I am string two. I like apples and oranges. We are all strings here.
  3. Hello, I am string three. I like apples and oranges. We are all strings here.
  4. Hello, I am string four. I like apples and oranges. I like to express my individuality.

I'd like the script to tell me what are the common elements between the strings, above a certain threshold (for example, 5 characters).

Ideally I'd be told

  • "I like apples and oranges" occurs in all files
  • "Hello, I am string" occurs in all files
  • "We are all strings here" occurs in three of the files.

If functions exist to do this in technologies I'm familiar with - SQL, Javascript, PHP, Ruby or Bash -I'll be extremely happy...

Many thanks,

Jack

Jack Shepherd
  • 165
  • 4
  • 12
  • This question is closely related, and has many relevant answers: http://stackoverflow.com/questions/1410822/how-can-i-detect-common-substrings-in-a-list-of-strings – Anderson Green Jun 04 '13 at 08:00

1 Answers1

2

This is a hard problem known as the Longest common subsequence problem.

Here is a Python implementation of the algorithm using dynamic programming: http://www.algorithmist.com/index.php/Longest_Common_Subsequence

I don't think that any standard library (C, Java, PHP, Python, Javascript, Ruby, etc.) comes with such a function. But you may look for implementations here: http://www.google.com/codesearch?q=%22longest+common+subsequence%22

scoffey
  • 4,638
  • 1
  • 23
  • 27
  • Ah, thanks for that. Now I know the name I managed to find some pre-built implementations: http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Longest_common_substring#PHP – Jack Shepherd Jan 13 '11 at 17:30