Assuming I had collection of around 5000 product codes, how can I identify valid string patterns?
prod_codes = [
'03578180000',
'03573880000',
'03575350000',
'15459990000',
'15479850000',
'15481130000',
'15478930000',
'15479790000',
'15481150000',
'15479490000'
]
In this small example, there are 2 distinct styles of product codes: 154
s and 035(7)
s. I assume product codes end with 0000
. All codes are length 11. but I want technology to tell me with certainty using a larger sample.
Ultimately I want some sort of a list of valid formats... like
154\d{4}0000
0357\d{3}0000
The format above isn't important at all, I just want the insight so I can make documentation with confidence.
Answering these questions strictly based on my own observations can be easy to test, but hard to identify in the first place, and relies on intuition. I can test this dataset by running a bunch of group_by
type filters, and other "one at a time test this theory" type methods.
My generic attempt might be to collect a Set
of each character position for all of the data, and then analyze the #size
of each character position set. When I see something like 1-2 for size, it tells me that this digit is fixed in some way. When I see something like 5+, it tells me it's likely \d. Using this info, try to create a reasonable number of format strings.
What machine learning strategies might I search for to learn how to analyze this data in such a way?
Does this algorithm have a name? I feel like there is probably some machine learning strategy that not only can group my data up per character, but also find patterns like the ^154
kind of stuff and just spit out some ideas after eating my dataset.
I'd prefer an answer that I can leverage in Ruby or JS, but whatever you can offer would be helpful.