2

I have a dataset of names. Based on the alphanumeric strings of name,I need to map them to Subname as given below.

Name            Subname
9-AIF-09        9A09
980-PD-Z09A     980P09
15-KIC-12       15K12
PIA-110H        P-110
IC009A          I009A

There can be defined rules like, if 'A' is present in name then keep all digits and alphabet 'A', 'P' is in the name then only 'P' is carried forward. Patterns must be identified by the algorithm itself about how a mapping is done.

Is there any algorithm I can use to identify patterns from training dataset to further predict.

spd
  • 334
  • 1
  • 12
  • Very interesting question! Sadly, search engines are helpless to find whether someone already tackled this problem. They keep returning pages about pattern-matching, not about pattern inferring. – Stef Mar 29 '22 at 10:13
  • This is somewhat related: [Grammatical inference of regular expressions for given finite list of representative strings?](https://stackoverflow.com/questions/15512918/grammatical-inference-of-regular-expressions-for-given-finite-list-of-representa) – Stef Mar 29 '22 at 10:14

1 Answers1

1

I see two options.

getting 3 groups (before first letter, 1st letter, after 1st letter) and removing all non digits in groups 1 and 3:

import re
df['Subname'] = df['Name'].str.replace(r'([^a-zA-Z]+)([a-zA-Z])(.*)',
                                       lambda m: (re.sub('\D', '', m.group(1))
                                                  +m.group(2)
                                                  +re.sub('\D', '', m.group(3))),
                                      regex=True)

Or, defining a pattern: non-digits/digits/non-digits/letter/non-digits/digits/non-digits:

df['Subname'] = (df['Name'].str.extract(r'\D*(\d+)[^\da-zA-Z]*([a-zA-Z])\D*(\d+)')
                           .agg(''.join, axis=1)
                 )

output

          Name Subname
0     9-AIF-09    9A09
1  980-PD-Z09A  980P09
2    15-KIC-12   15K12
mozway
  • 194,879
  • 13
  • 39
  • 75