I got a Pandas dataframe which contains a column with pretty long strings (let's say URL_paths) and a list of unique substrings (reference list). For every row in my dataframe, I want to determine the corresponding reference element in my list. Hence, if the URL in a given row is for example abcd1234
, and one of the reference values is cd123
, then I want to add cd123
as reference to my dataframe, to categorize this row/URL.
I got my code working (see example below), but it's pretty slow due to a for loop (I guess) which I can't get rid off. I got the feeling that my code can be much faster, but can't think of a way to improve it.
How can I improve running time?
See working example below:
import string
import secrets
import pandas as pd
import time
from random import randint
n_ref = 100
n_target = 1000000
## Build reference Series, and target dataframe
reference = pd.Series(''.join(secrets.choice(string.ascii_uppercase + string.digits) for _ in range(randint(10, 19)))
for _ in range(n_ref))
target = pd.Series(reference.sample(n = n_target, replace = True)).reset_index().iloc[:,1]
dfTarget = pd.DataFrame({
'target' : target,
'pre-string' : pd.Series(''.join(secrets.choice(string.ascii_uppercase + string.digits)
for _ in range(randint(1, 10)))
for _ in range(n_target)),
'post-string' : pd.Series(''.join(secrets.choice(string.ascii_uppercase + string.digits)
for _ in range(randint(1, 10)))
for _ in range(n_target)),
'reference' : pd.Series()})
dfTarget['target_combined'] = dfTarget[['pre-string', 'target', 'post-string']].apply(lambda x: ''.join(x), axis=1)
## Fill in reference column
## Loop over references and return reference in reference column
start_time = time.time()
for x in reference:
dfTarget.loc[dfTarget['target_combined'].str.contains(x) == True, 'reference'] = x
print("--- %s seconds ---" % (time.time() - start_time))
Out: 42.60... seconds