0

I have dataframe with two columns:

col1           col2
"aaa bbb"      some_regex_str1
"zzz aaa"      some_regex_str2
"sda343das"    some_regex_str3
...
"999 aaa dsd"  some_regex_strN

the length of the dataframe can be anything between 10^6 - 10^7.

Currently,

I do:

df['output'] = df.apply(lambda row: re.search(row['col2'], row['col1'], axis=1)

It is slow.

What is the more efficient way to do it?

EDIT:

I have created yo.py module with

import re


def run_regex(x):
    return re.search(x['col2'], x['col1'])

in main module I do:

from yo import run_regex

...

res = df.parallel_apply(run_regex)

but I still get

AttributeError: Can't pickle local object 'prepare_worker.<locals>.closure.<locals>.wrapper'
Dariusz Krynicki
  • 2,544
  • 1
  • 22
  • 47

1 Answers1

0

You can parallelize your apply manually or via pandarallel. Another option is to use more efficient regex lib like hyperscan or re2.

If your regex string is simple string (i.e. your problem is to search substring in string), you can use Aho-Corasick algorithm. If you have a lot of duplicated values in col2, this will be the best solution.

EDIT: I've added pandarallel example:

import re

import pandas as pd

from pandarallel import pandarallel
pandarallel.initialize()

def f(x):
    return re.search(x['a'], x['b']).group()

df = pd.DataFrame([
    {'a': '11', 'b': '11'}
] * 100)

df.parallel_apply(f, axis=1)
Grigory Feldman
  • 405
  • 3
  • 7