2

I have a fairly simply regex expression but for some reason it's not capturing all the instances.

My dataframe looks like this (including all the 74 rows because I don't know where the problem occurs):

Name
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M

If I pass

In [57]: len(df['Name'])

I get

Out [57]: 74

I created a regex expression as follows:

p = re.compile('_[A-z][0-9][0-9][0-9]_')

I want to create a column where the snippet that looks a bit like '_A122_' or '_A100_' etc is the value. I want to use regex because I later want to apply this piece of code to a larger set where the snippet does not always appear at the same position.

When I use the following command, the result is a list of the form I was looking for:

In [55]: p.findall(str(df['Name']))
Out[55]: 
['_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A122_']

The problem is, this list is "too short". Using len(p.findall(str(df['Name']))), I get 60 as the result. I cannot see which 14 rows it's missing!

I'm not used to regex expressions so maybe it's a super obvious mistake but I'd really appreciate any help.

(I guess I could do a for-loop and create the new column cell by cell, but I'd really rather avoid that since I will apply this code to bigger datasets later and don't want it to take a million years to run)

Nullman
  • 4,179
  • 2
  • 14
  • 30
J.D
  • 425
  • 4
  • 8
  • 19

1 Answers1

3

IIUC you can use .str.extract() in order to extract a substring that matches your RegEx:

In [55]: df.Name.str.extract(r'(_[a-zA-Z]\d{3}_)', expand=False)
Out[55]:
0     _A122_
1     _A122_
2     _A122_
3     _A122_
4     _A122_
5     _A122_
6     _A122_
7     _A122_
8     _A122_
9     _A122_
       ...
64    _A100_
65    _A100_
66    _A100_
67    _A100_
68    _A100_
69    _A100_
70    _A100_
71    _A100_
72    _A100_
73    _A100_
Name: Name, dtype: object

PS you should NOT use str(df['Name']) as the string representation of Pandas DF will be shorten:

In [58]: pd.options.display.max_rows = 4

In [59]: df
Out[59]:
                              Name
0   P0824AK03.VAK03_TK02_QE_A122_M
1   P0824AK03.VAK03_TK02_QE_A122_M
..                             ...
72  P0824AK03.VAK03_TK02_QE_A100_M
73  P0824AK03.VAK03_TK02_QE_A100_M

[74 rows x 1 columns]

In [60]: str(df['Name'])
Out[60]: '0     P0824AK03.VAK03_TK02_QE_A122_M\n1     P0824AK03.VAK03_TK02_QE_A122_M\n                   ...              \n72
    P0824AK03.VAK03_TK02_QE_A100_M\n73    P0824AK03.VAK03_TK02_QE_A100_M\nName: Name, dtype: object'
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • It worked, thank you. Do you think the problem was in my faulty using of str(df['Name']) ? – J.D May 15 '17 at 12:07
  • @J.Dahlgren, you are welcome. Yes, usually we want to use Pandas/NumPy/SciPy __vectorized__ functions when working with Pandas data frames. As i've shown in my answer Pandas will shorten `str(df['Name'])` - in order NOT to flood your screen with data. Imagine if you have a DF with a few. billion rows - the output to the screen would take too long and `str(df['Name'])` could end up with a MemoryError... – MaxU - stand with Ukraine May 15 '17 at 12:11