9

Is there a way to extract a common pattern in a list of strings in Java?

For example, if we have a list of values:

001-L1
002-L2
003-L3
004-L4
...

Is there a way to deduce that we have 3digits, followed by '-', then a letter L and finally a numerical character?

I think it has something to do with common substrings or something like that but I haven't been able to find anything yet.

Thank you!

EDIT: Obviously it won't be a perfect recognition, it'll just return a recommendation based on the data.

What I'm trying to build is something close to this. In the video, when the user clicks on the column, there's a recommendation to split the data on ":".

skrtbhtngr
  • 2,223
  • 23
  • 29

1 Answers1

4

I think you may want to "deduce" the pattern that a set of strings might have in common, and not validate them using regex. This problem may belong to pattern recognition.

  • You can apply the Longest Common Substring (not Longest Common Subsequence) algorithm on any two of your strings, first. Note that according to your list of strings, you may get two longest common substrings 00 and -L, so you need to take care of it.
  • Then, when you get a common substring as a result, simply use the contains() method to check for the pattern in the other strings.

This method works well only when the common pattern between the strings is at least a few characters.

EDIT:

If you want to implement something like in the given video, you just need to split the strings based on a certain delimiter. An easy and naive approach:

  • Create a list of possible delimiters, like :,.,-,,,:: etc.
  • Search all your strings for the occurrence of a certain delimiter. The LCS algorithm would not work as the strings might have common data values (like "Yes" and "No" as in the video) which are not intended as a delimiter.
  • split the strings based on the delimiter, if it is found in all (or even most) of the strings!

There might be more optimal solutions than this one!

skrtbhtngr
  • 2,223
  • 23
  • 29
  • Yeah that's a way to tackle the problem. It's exactly pattern recognition that I'm looking for, but I haven't been able to find anything that can help me. Why does the method works best if the string is at least a few characters? For the example above, can't we get -L for all the strings (assuming all the values are constructed this way)? – Raphael Khoury Nov 23 '16 at 08:58
  • 1
    If your set of strings have just one character, like `-` as the common pattern, the LCS algorithm may have trouble finding it because `00` will be detected first. `-L` would work fine but you might have to give input to the algorithm two strings like `012-L4` and `001-L5` so that there is **only** one longest common substring (`-L`). – skrtbhtngr Nov 23 '16 at 09:03
  • Check my edit in the original post, it may show what I'm trying to do. And yes, of course the data is not restricted to the 4 values above, the list is quite longer and then we can find the -L pattern. – Raphael Khoury Nov 23 '16 at 09:05
  • It seems to be a [grammar induction](https://en.wikipedia.org/wiki/Grammar_induction) problem, since the goal is to find a pattern that is matched by a set of strings. – Anderson Green Jul 11 '22 at 22:46