Regex issue in Python 3

Question

I have a fairly simply regex expression but for some reason it's not capturing all the instances.

My dataframe looks like this (including all the 74 rows because I don't know where the problem occurs):

Name
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M

If I pass

In [57]: len(df['Name'])

I get

Out [57]: 74

I created a regex expression as follows:

p = re.compile('_[A-z][0-9][0-9][0-9]_')

I want to create a column where the snippet that looks a bit like '_A122_' or '_A100_' etc is the value. I want to use regex because I later want to apply this piece of code to a larger set where the snippet does not always appear at the same position.

When I use the following command, the result is a list of the form I was looking for:

In [55]: p.findall(str(df['Name']))
Out[55]: 
['_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A122_',
 '_A100_',
 '_A100_',
 '_A100_',
 '_A122_']

The problem is, this list is "too short". Using len(p.findall(str(df['Name']))), I get 60 as the result. I cannot see which 14 rows it's missing!

I'm not used to regex expressions so maybe it's a super obvious mistake but I'd really appreciate any help.

(I guess I could do a for-loop and create the new column cell by cell, but I'd really rather avoid that since I will apply this code to bigger datasets later and don't want it to take a million years to run)

BTW, `[A-z]` [matches more than just letters](http://stackoverflow.com/a/29771926/3832970), you need to use `[A-Za-z]`. — Wiktor Stribiżew, May 15 '17 at 11:47
Thanks Wiktor, however in this case that doesn't change the result, I still don't get 74 items in my list — J.D, May 15 '17 at 11:52
I got `TypeError: expected string or bytes-like object` when I tried that. But the question has been answered now and the problem is solved :) — J.D, May 15 '17 at 12:11

MaxU - stand with Ukraine · Accepted Answer · 2017-05-15T12:01:10.350

IIUC you can use .str.extract() in order to extract a substring that matches your RegEx:

In [55]: df.Name.str.extract(r'(_[a-zA-Z]\d{3}_)', expand=False)
Out[55]:
0     _A122_
1     _A122_
2     _A122_
3     _A122_
4     _A122_
5     _A122_
6     _A122_
7     _A122_
8     _A122_
9     _A122_
       ...
64    _A100_
65    _A100_
66    _A100_
67    _A100_
68    _A100_
69    _A100_
70    _A100_
71    _A100_
72    _A100_
73    _A100_
Name: Name, dtype: object

PS you should NOT use str(df['Name']) as the string representation of Pandas DF will be shorten:

In [58]: pd.options.display.max_rows = 4

In [59]: df
Out[59]:
                              Name
0   P0824AK03.VAK03_TK02_QE_A122_M
1   P0824AK03.VAK03_TK02_QE_A122_M
..                             ...
72  P0824AK03.VAK03_TK02_QE_A100_M
73  P0824AK03.VAK03_TK02_QE_A100_M

[74 rows x 1 columns]

In [60]: str(df['Name'])
Out[60]: '0     P0824AK03.VAK03_TK02_QE_A122_M\n1     P0824AK03.VAK03_TK02_QE_A122_M\n                   ...              \n72
    P0824AK03.VAK03_TK02_QE_A100_M\n73    P0824AK03.VAK03_TK02_QE_A100_M\nName: Name, dtype: object'

It worked, thank you. Do you think the problem was in my faulty using of str(df['Name']) ? — J.D, May 15 '17 at 12:07
@J.Dahlgren, you are welcome. Yes, usually we want to use Pandas/NumPy/SciPy __vectorized__ functions when working with Pandas data frames. As i've shown in my answer Pandas will shorten `str(df['Name'])` - in order NOT to flood your screen with data. Imagine if you have a DF with a few. billion rows - the output to the screen would take too long and `str(df['Name'])` could end up with a MemoryError... — MaxU - stand with Ukraine, May 15 '17 at 12:11

Regex issue in Python 3

1 Answers1