7
    dataframe = pd.DataFrame({'Date':['This 1A1619 person BL171111 the A-1-24',
                                  'dont Z112 but NOT 1-22-2001',
                                  'mix: 1A25629Q88 or A13B ok'], 
                          'IDs': ['A11','B22','C33'],
                          }) 

           Date                                 IDs
0   This 1A1619 person BL171111 the A-1-24      A11
1   dont Z112 but NOT 1-22-2001                 B22
2   mix: 1A25629Q88 or A13B ok                  C33

I have the dataframe above. My goal is to replace all mixed word/number combo's WITHOUT hyphens - e.g. 1A1619I or BL171111 or A13B but NOT 1-22-2001 or A-1-24 with the letter M. I have attempted to use the code below via identify letter/number combinations using regex and storing in dictionary

dataframe['MixedNum'] = dataframe['Date'].str.replace(r'(?=.*[a-zA-Z])(\S+\S+\S+)','M') 

But I get this output

                          Date              IDs     MixedNum
0   This 1A1619 person BL171111 the A-1-24  A11     M M M M M M M
1   dont Z112 but NOT 1-22-2001             B22     M M M M 1-22-2001
2   mix: 1A25629Q88 or A13B ok              C33     M M or M ok

when I would really want this output

                          Date              IDs     MixedNum
0   This 1A1619 person BL171111 the A-1-24  A11     This M person M the A-1-24 
1   dont Z112 but NOT 1-22-2001             B22     dont M but NOT 1-22-2001
2   mix: 1A25629Q88 or A13B ok              C33     mix: M or M ok

I also tried the regex suggested here but it also didnt work for me Regex replace mixed number+strings

Can anyone help me alter my regex? r'(?=.*[a-zA-Z])(\S+\S+\S+

rafaelc
  • 57,686
  • 15
  • 58
  • 82

1 Answers1

4

You may use

pat = r'(?<!\S)(?:[a-zA-Z]+\d|\d+[a-zA-Z])[a-zA-Z0-9]*(?!\S)'
dataframe['MixedNum'] = dataframe['Date'].str.replace(pat, 'M')

Output:

>>> dataframe
                                     Date  IDs                    MixedNum
0  This 1A1619 person BL171111 the A-1-24  A11  This M person M the A-1-24
1             dont Z112 but NOT 1-22-2001  B22    dont M but NOT 1-22-2001
2              mix: 1A25629Q88 or A13B ok  C33              mix: M or M ok

Pattern details

  • (?<!\S) - a whitespace or start of string should immediately precede the current location
  • (?:[a-zA-Z]+\d|\d+[a-zA-Z]) - either
    • [a-zA-Z]+\d - 1+ letters and a digit
    • | - or
    • \d+[a-zA-Z] - 1+ digits and a letter
  • [a-zA-Z0-9]* - 0+ digits or letters
  • (?!\S) - a whitespace or end of string should follow the current location immediately.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563