pandas apply column with regex string to other column

Question

I have dataframe with two columns:

col1           col2
"aaa bbb"      some_regex_str1
"zzz aaa"      some_regex_str2
"sda343das"    some_regex_str3
...
"999 aaa dsd"  some_regex_strN

the length of the dataframe can be anything between 10^6 - 10^7.

Currently,

I do:

df['output'] = df.apply(lambda row: re.search(row['col2'], row['col1'], axis=1)

It is slow.

What is the more efficient way to do it?

EDIT:

I have created yo.py module with

import re


def run_regex(x):
    return re.search(x['col2'], x['col1'])

in main module I do:

from yo import run_regex

...

res = df.parallel_apply(run_regex)

but I still get

AttributeError: Can't pickle local object 'prepare_worker.<locals>.closure.<locals>.wrapper'

I have created sample dataframe. The question is about application of these regexes on these strings. I do not want to change the regexes itself and change reason why I apply it. I wonder about most efficient way of these regexes apply. — Dariusz Krynicki, Jan 17 '20 at 12:45
Maybe [this post](https://stackoverflow.com/questions/42742810/speed-up-millions-of-regex-replacements-in-python-3) will be helpful. — Grigory Feldman, Jan 17 '20 at 12:58

Grigory Feldman · Answer 1 · 2020-01-17T13:43:37.840

0

You can parallelize your apply manually or via pandarallel. Another option is to use more efficient regex lib like hyperscan or re2.

If your regex string is simple string (i.e. your problem is to search substring in string), you can use Aho-Corasick algorithm. If you have a lot of duplicated values in col2, this will be the best solution.

EDIT: I've added pandarallel example:

import re

import pandas as pd

from pandarallel import pandarallel
pandarallel.initialize()

def f(x):
    return re.search(x['a'], x['b']).group()

df = pd.DataFrame([
    {'a': '11', 'b': '11'}
] * 100)

df.parallel_apply(f, axis=1)

edited Jan 17 '20 at 13:43

answered Jan 17 '20 at 12:49

Grigory Feldman

405
3
7

I have tried pandarallel with regex but it complained. I will try to share the error. – Dariusz Krynicki Jan 17 '20 at 12:50
AttributeError: Can't pickle local object 'prepare_worker..closure..wrapper' – Dariusz Krynicki Jan 17 '20 at 12:57
ty. please see my edit comment. i sitll get the error. – Dariusz Krynicki Jan 17 '20 at 13:32
I've added example to my answer. – Grigory Feldman Jan 17 '20 at 13:43

pandas apply column with regex string to other column

1 Answers1