0

I am trying to write a regex that matches columns in my dataframe. All the columns in the dataframe are

    cols = ['after_1', 'after_2', 'after_3', 'after_4', 'after_5', 'after_6',
   'after_7', 'after_8', 'after_9', 'after_10', 'after_11', 'after_12',
   'after_13', 'after_14', 'after_15', 'after_16', 'after_17', 'after_18',
   'after_19', 'after_20', 'after_21', 'after_22', 'after_10_missing',
   'after_11_missing', 'after_12_missing', 'after_13_missing',
   'after_14_missing', 'after_15_missing', 'after_16_missing',
   'after_17_missing', 'after_18_missing', 'after_19_missing',
   'after_1_missing', 'after_20_missing', 'after_21_missing',
   'after_22_missing', 'after_2_missing', 'after_3_missing',
   'after_4_missing', 'after_5_missing', 'after_6_missing',
   'after_7_missing', 'after_8_missing', 'after_9_missing']

I want to select all the columns that have values in the strings that range from 1-14.

This code works

df.filter(regex = '^after_[1-9]$|after_([1-9]\D|1[0-4])').columns

but I'm wondering how to make it in one line instead of splititng it in two. The first part selects all strings that end in a number between 1 and 9 (i.e. 'after_1' ... 'after_9') but not their "missing" counterparts. The second part (after the |), selects any string that begins with 'after' and is between 1 and 9 and is followed by a word character, or begins with 1 and is followed by 0-4.

Is there a better way to write this?

I already tried

    df.filter(regex = 'after_([1-9]|1[0-4])').columns

But that picks up strings that begin with a 1 or a 2 (i.e. 'after_20')

m13op22
  • 2,168
  • 2
  • 16
  • 35

1 Answers1

1

Try this: after_([1-9]|1[0-4])[a-zA-Z_]*\b

import re
regexp = '''(after_)([1-9]|1[0-4])(_missing)*\\b'''
cols = ['after_1', 'after_14', 'after_15', 'after_14_missing', 'after_15_missing', 'after_9_missing']

for i in cols:
 print(i , re.findall(regexp, i))

enter image description here

HakunaMaData
  • 1,281
  • 12
  • 26
  • Thanks, this works and I can see how to generalize it to other situations where I need to find a range of numbers between words in a string. I thought that the function would match my pattern to the strings, but I needed to specify that the pattern could be followed by numeric characters. – m13op22 Jan 01 '19 at 02:53