I have a fairly simply regex expression but for some reason it's not capturing all the instances.
My dataframe looks like this (including all the 74 rows because I don't know where the problem occurs):
Name
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A122_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
P0824AK03.VAK03_TK02_QE_A100_M
If I pass
In [57]: len(df['Name'])
I get
Out [57]: 74
I created a regex expression as follows:
p = re.compile('_[A-z][0-9][0-9][0-9]_')
I want to create a column where the snippet that looks a bit like '_A122_' or '_A100_' etc is the value. I want to use regex because I later want to apply this piece of code to a larger set where the snippet does not always appear at the same position.
When I use the following command, the result is a list of the form I was looking for:
In [55]: p.findall(str(df['Name']))
Out[55]:
['_A100_',
'_A122_',
'_A100_',
'_A100_',
'_A122_',
'_A100_',
'_A100_',
'_A100_',
'_A122_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A122_',
'_A100_',
'_A100_',
'_A100_',
'_A122_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A100_',
'_A122_',
'_A100_',
'_A122_',
'_A100_',
'_A100_',
'_A100_',
'_A122_',
'_A100_',
'_A100_',
'_A122_',
'_A100_',
'_A100_',
'_A100_',
'_A122_']
The problem is, this list is "too short". Using len(p.findall(str(df['Name']))), I get 60 as the result. I cannot see which 14 rows it's missing!
I'm not used to regex expressions so maybe it's a super obvious mistake but I'd really appreciate any help.
(I guess I could do a for-loop and create the new column cell by cell, but I'd really rather avoid that since I will apply this code to bigger datasets later and don't want it to take a million years to run)