1

I've checked these existing issues:

stack overflow .loc example 1

stack overflow .loc example 2

stack overflow .loc example 2

...but I don't yet grok this issue fully.

I'm trying to write a module to match strings by progressively transforming them on a source and target, and checking for additional matches. To keep track of repeated transform/match attempts, I'm using dataframes for source, target, and the matches.

So part of the resolution is to create source/target subsets for items not-yet-matched, apply the transformations, and pull out any matches that result. So my code looks like this:

import pandas as pd

def trymatch(transformers):

    global matches, source, target

    # Don't bother doing work if we've already found a match
    if matches is not None:
        s_ids = matches['id_s'].values
        s_inmask = (~source['id'].isin(s_ids))
        s = source.loc[s_inmask].copy()
        # ... do the same for the target dataframe
    else:
        s = source
        t = target

    for transformer in transformers:
        # Call the transformations here...

    mnew = pd.merge(s, t, on='matchval', suffixes=['_s', '_t'])

    if matches is None: matches = mnew
    else: matches = matches.append(mnew)

# ----------------------------------------------------------------------------------------------------------------------

source = pd.DataFrame({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
target = pd.DataFrame({'id': [4, 5, 6], 'value': ['A', 'b', 'd']})

matches = None
trymatch(['t_null'])
trymatch(['t_upper'])

My challenge comes with the trymatch function, where if matches already exist, I'm creating the subsets. Even with the .loc indexing, Python is throwing SettingWithCopyWarning's at me. I can get rid of them with the .copy() as I've shown here... I think this is valid because I just need temporary copies of the subsets for this function.

Does this seem valid? Could I just suppress with .is_copy = False and save memory?

Is there a more Pythonic way of approaching this problem that would side-step this issue entirely?

Community
  • 1
  • 1
richarddb
  • 49
  • 1
  • 4

1 Answers1

0

What you've written is valid. pandas throws SettingsWithCopy warnings in cases like this one because it relies on numpy array semantics, which for the purpose of efficiency return views of data, not copies. pandas can't itself detect when this will cause a problem, hence it (conservatively) just throws this error in both good cases and bad ones.

You can get rid of the error message using:

pd.options.mode.chained_assignment = None  # default='warn'

For more details see How to deal with SettingWithCopyWarning in Pandas?

Community
  • 1
  • 1
Aleksey Bilogur
  • 3,686
  • 3
  • 30
  • 57