I've checked these existing issues:
...but I don't yet grok this issue fully.
I'm trying to write a module to match strings by progressively transforming them on a source and target, and checking for additional matches. To keep track of repeated transform/match attempts, I'm using dataframes for source, target, and the matches.
So part of the resolution is to create source/target subsets for items not-yet-matched, apply the transformations, and pull out any matches that result. So my code looks like this:
import pandas as pd
def trymatch(transformers):
global matches, source, target
# Don't bother doing work if we've already found a match
if matches is not None:
s_ids = matches['id_s'].values
s_inmask = (~source['id'].isin(s_ids))
s = source.loc[s_inmask].copy()
# ... do the same for the target dataframe
else:
s = source
t = target
for transformer in transformers:
# Call the transformations here...
mnew = pd.merge(s, t, on='matchval', suffixes=['_s', '_t'])
if matches is None: matches = mnew
else: matches = matches.append(mnew)
# ----------------------------------------------------------------------------------------------------------------------
source = pd.DataFrame({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
target = pd.DataFrame({'id': [4, 5, 6], 'value': ['A', 'b', 'd']})
matches = None
trymatch(['t_null'])
trymatch(['t_upper'])
My challenge comes with the trymatch function, where if matches already exist, I'm creating the subsets. Even with the .loc indexing, Python is throwing SettingWithCopyWarning's at me. I can get rid of them with the .copy() as I've shown here... I think this is valid because I just need temporary copies of the subsets for this function.
Does this seem valid? Could I just suppress with .is_copy = False and save memory?
Is there a more Pythonic way of approaching this problem that would side-step this issue entirely?