How to use regex as an alternative to nested loop with re.search in Pandas?

Question

This is auxiliary to this post: Is there better way to iterate over nested loop for rows (30000)?

I would like to find rows in following table, which contain non dutch bank numbers. I create an extra column in the dataframe in the following code with value 'ja', when a bank number is not dutch, i.e. it does not contain characters 'NL'. Any better ways to loop over the rows?

import pandas as pd
import re

data={'IBAN':['NLAMRO123456789','NLINGB126656723','BGFFEO128856754','NLAMRO123896763','DUDMRO567456722','NLRABO123456712']}
df=pd.DataFrame(data)

df['No Dutch Bank']=None
substring='NL'
row_count=len(df.index)


for i in range (0, row_count):
    mainstring=df['IBAN'][i]
    if re.search(substring, mainstring) is None:
        df['No Dutch Bank'][i]='ja'
        
df.head()

So far I have reached to this expression after browsing through internet:

df['No Dutch Bank'] = list(map(lambda x: "None" if x else 'ja', (re.search('NL', x))))

Tranbi · Accepted Answer · 2022-10-26T09:35:13.603

1

A loop isn't necessary here. Neither is regex.

If you're only looking for IBAN (not) starting with "NL":

import pandas as pd

data={'IBAN':['NLAMRO123456789','NLINGB126656723','BGFFEO128856754','NLAMRO123896763','DUDMRO567456722','NLRABO123456712']}
df=pd.DataFrame(data)

df['No Dutch Bank'] = df['IBAN'].str[0:2] != 'NL'

Output:

              IBAN  No Dutch Bank
0  NLAMRO123456789          False
1  NLINGB126656723          False
2  BGFFEO128856754           True
3  NLAMRO123896763          False
4  DUDMRO567456722           True
5  NLRABO123456712          False

Note that a boolean column is much more convenient than string "ja" or "None" for filtering: use df[df['No Dutch Bank']] or df[~df['No Dutch Bank']]

If "NL" can be found anywhere in the string:

df['No Dutch Bank'] = df['IBAN'].str.find('NL') == -1

edited Oct 26 '22 at 09:35

answered Oct 26 '22 at 09:20

Tranbi

11,407
6
16
33

technically the str operation is a loop, but that's just nitpicking – sammywemmy Oct 26 '22 at 10:48
@sammywemmy is it though? should be much more efficient since it's described as [vectorized string function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.html) – Tranbi Oct 26 '22 at 10:53
1

under the hood it just uses python string methods, and runs through every loop ... vectorized in this case means simplifed form. vectorized for numbers in pandas and numpy is pushed into C, so faster. Have a look at the internals to see how it is executed. Now if the strings were pyarrow strings, that is truly vectorized in the sense that the computation is pushed into C++ i think – sammywemmy Oct 26 '22 at 10:55
thanks for the tip over boolean column. – Rohit Nirmal Oct 26 '22 at 11:28

score 0 · Answer 2 · answered Oct 26 '22 at 11:32

0

I finally used the above answer by Tranbi in this format:

df.loc[df['IBAN'].str.find('NL') == -1, 'No Dutch Bank']='ja'

answered Oct 26 '22 at 11:32

Rohit Nirmal

47
1
8

How to use regex as an alternative to nested loop with re.search in Pandas?

2 Answers2