How to detect valid formats of a collection of strings?

Question

Assuming I had collection of around 5000 product codes, how can I identify valid string patterns?

prod_codes = [
  '03578180000',
  '03573880000',
  '03575350000',
  '15459990000',
  '15479850000',
  '15481130000',
  '15478930000',
  '15479790000',
  '15481150000',
  '15479490000'
]

In this small example, there are 2 distinct styles of product codes: 154s and 035(7)s. I assume product codes end with 0000. All codes are length 11. but I want technology to tell me with certainty using a larger sample.

Ultimately I want some sort of a list of valid formats... like

154\d{4}0000
0357\d{3}0000

The format above isn't important at all, I just want the insight so I can make documentation with confidence.

Answering these questions strictly based on my own observations can be easy to test, but hard to identify in the first place, and relies on intuition. I can test this dataset by running a bunch of group_by type filters, and other "one at a time test this theory" type methods.

My generic attempt might be to collect a Set of each character position for all of the data, and then analyze the #size of each character position set. When I see something like 1-2 for size, it tells me that this digit is fixed in some way. When I see something like 5+, it tells me it's likely \d. Using this info, try to create a reasonable number of format strings.

What machine learning strategies might I search for to learn how to analyze this data in such a way?

Does this algorithm have a name? I feel like there is probably some machine learning strategy that not only can group my data up per character, but also find patterns like the ^154 kind of stuff and just spit out some ideas after eating my dataset.

I'd prefer an answer that I can leverage in Ruby or JS, but whatever you can offer would be helpful.

Could you not use a regular expression such as `Regexp.union("154\d{4}0000", "0357\d{3}0000") #=> /154d\{4\}0000|0357d\{3\}0000/` or (almost equivalently), `Regexp.union(/154\d{4}0000/, /0357\d{3}0000/) #=> /(?-mix:154\d{4}0000)|(?-mix:0357\d{3}0000)/`, extended to include all patterns of interest? See the doc for [Regexp::union](http://ruby-doc.org/core-2.4.0/Regexp.html#method-c-union). A "machine learning" approach here strikes me as pie-in-the-sky; just identify valid patterns. — Cary Swoveland, Aug 30 '17 at 04:38
I believe they're asking for a way to identify "all patterns of interest" so they can do something along those lines, that's just my reading though — Simple Lime, Aug 30 '17 at 04:41
Seems like it *might* be a dup of https://stackoverflow.com/questions/1410822/how-can-i-detect-common-substrings-in-a-list-of-strings (by way of https://stackoverflow.com/questions/30809687/find-common-patterns-across-strings-and-group-them-based-on-the-pattern ) but I'm a little on the fence about it — Simple Lime, Aug 30 '17 at 04:42
That first link has a lot of similarity to my problem, and also a lot of new words and concepts to explore in the answers section. I'll take a stroll through that in the morning. Thanks for the note! — Mr. Tim, Aug 30 '17 at 05:08

How to detect valid formats of a collection of strings?

0 Answers0