3

Could you please advise how the following lines should be re-written based on http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

  1. df.drop('PACKETS', axis=1, inplace=True)

produces

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df.drop('PACKETS', axis=1, inplace=True)
/home/app/ip-spotlight/code/app/ipacc/plugin/ix.py:74: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame
  1. df.replace(numpy.nan, "", inplace=True)

produces

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df.replace(numpy.nan, "", inplace=True)
/home/app/ip-spotlight/code/app/ipacc/plugin/ix.py:68: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

On the other hand, the following is an example of how it was re-written based on the above principle

df.loc[:, ('SRC_PREFIX')]   = df[ ['SRC_NET', 'SRC_MASK'] ].apply(lambda x: "/".join(x), axis=1)

But i am unable to figure out how to re-write the cases 1 and 2 ?

EDIT: the code so far it looks like this (df is the dataframe of interest). So initially the is some kind of casting:

df = pandas.DataFrame(data['payload'], columns=sorted(data['header'], key=data['header'].get))
        df = df.astype({
            'SRC_AS'                : "object",
            'DST_AS'                : "object",
            'COMMS'                 : "object",
            'SRC_COMMS'             : "object",
            'AS_PATH'               : "object",
            'SRC_AS_PATH'           : "object",
            'PREF'                  : "object",
            'SRC_PREF'              : "object",
            'MED'                   : "object",
            'SRC_MED'               : "object",
            'PEER_SRC_AS'           : "object",
            'PEER_DST_AS'           : "object",
            'PEER_SRC_IP'           : "object",
            'PEER_DST_IP'           : "object",
            'IN_IFACE'              : "object",
            'OUT_IFACE'             : "object",
            'SRC_NET'               : "object",
            'DST_NET'               : "object",
            'SRC_MASK'              : "object",
            'DST_MASK'              : "object",
            'PROTOCOL'              : "object",
            'TOS'                   : "object",
            'SAMPLING_RATE'         : "uint64",
            'EXPORT_PROTO_VERSION'  : "object",
            'PACKETS'               : "object",
            'BYTES'                 : "uint64",
        })

Then the calculate function of a module is called:

mod.calculate(data['identifier'], data['timestamp'], df)

And the calculate function is defined like this:

def calculate(identifier, timestamp, df):
    try:
        #   Filter based on AORTA IX.
        lut_ipaddr = lookup_ipaddr()
        df = df[ (df.PEER_SRC_IP.isin( lut_ipaddr )) ]
        if df.shape[0] > 0:
            logger.info('analyzing message `{}`'.format(identifier))
            #   Preparing for input.
            df.replace("", numpy.nan, inplace=True)
            #   Data wrangling. Calculate traffic rate. Reduce.
            df.loc[:, ('BPS')]          = 8*df['BYTES']*df['SAMPLING_RATE']/300
            df.drop(columns=['SAMPLING_RATE', 'EXPORT_PROTO_VERSION', 'PACKETS', 'BYTES'], inplace=True)
            #   Data wrangling. Formulate prefixes using CIDR notation. Reduce.
            df.loc[:, ('SRC_PREFIX')]   = df[ ['SRC_NET', 'SRC_MASK'] ].apply(lambda x: "/".join(x), axis=1)
            df.loc[:, ('DST_PREFIX')]   = df[ ['DST_NET', 'DST_MASK'] ].apply(lambda x: "/".join(x), axis=1)
            df.drop(columns=['SRC_NET', 'SRC_MASK', 'DST_NET' ,'DST_MASK'], inplace=True)
            #   Populate using lookup tables.
            df.loc[:, ('NETELEMENT')]   = df['PEER_SRC_IP'].apply(lookup_netelement)
            df.loc[:, ('IN_IFNAME')]    = df.apply(lambda x: lookup_iface(x['NETELEMENT'], x['IN_IFACE']), axis=1)
            df.loc[:, ('OUT_IFNAME')]   = df.apply(lambda x: lookup_iface(x['NETELEMENT'], x['OUT_IFACE']), axis=1)
            # df.loc[:, ('SRC_ASNAME')]   = df.apply(lambda x: lookup_asn(x['SRC_AS']), axis=1)
            #   Add a timestamp.
            df.loc[:, ('METERED_ON')]   = arrow.get(timestamp, "YYYYMMDDHHmm").format("YYYY-MM-DD HH:mm:ss")
            #   Preparing for input.
            df.replace(numpy.nan, "", inplace=True)
            #   Finalize !
            return identifier, timestamp, df.to_dict(orient="records")
        else:
            logger.info('going through message `{}` no IX bgp/netflow data were found'.format(identifier))
    except Exception as e:
        logger.error('processing message `{}` at `{}` caused `{}`'.format(identifier,timestamp,repr(e)), exc_info=True)
    return identifier, timestamp, None
nskalis
  • 2,232
  • 8
  • 30
  • 49
  • Probably this can help you: https://www.dataquest.io/blog/settingwithcopywarning/ – Georgy Nov 09 '17 at 17:07
  • @Georgy thanks but if i am not mistaken the case with `drop` and `replace` that have no assignment are not covered. – nskalis Nov 09 '17 at 17:12
  • 2
    How do you define df? If df is a slice of another dataframe, the you can use `copy` to stop this warning. `df = df_full.loc[somefilters].copy()` – Scott Boston Nov 09 '17 at 17:30
  • @ScottBoston `drop` and `replace` need a label – nskalis Nov 09 '17 at 17:36
  • @iamsterdam I don't understand. need a label? – Scott Boston Nov 09 '17 at 17:41
  • @ScottBoston i assume that you are suggesting sth like that `df.loc[:, ('PACKETS')].drop()` maybe ? – nskalis Nov 09 '17 at 17:45
  • @iamsterdam No... I am suggesting that when you first create df. How do you do it? If df is created from a slice of a bigger dataframe, then you are trying to do assignments in this slice, you have a change of getting this error. To prevent this error. Create df with .copy. Then try your replace and drop statements as is. – Scott Boston Nov 09 '17 at 17:47
  • @ScottBoston ah no, the dataframe is created from a .csv file – nskalis Nov 09 '17 at 18:16
  • @iamsterdam Have you managed to resolve the issue? If you read the dataframe from csv and right after that you drop the column, then you shouldn't get the warning. Can you provide all the code up to the moment where you are dropping the column? – Georgy Nov 11 '17 at 11:26
  • thanks @Georgy. I edited the original post above to include this info. unfortunately not get, i didn't manage to. – nskalis Nov 11 '17 at 20:22

1 Answers1

2

Ok. I don't really know what is going on under the hood of pandas. But still, I've tried to come up with some minimal examples to show you where the problem can be and what you can do about it. First, creating dataframe:

import numpy as np
import pandas as pd
df = pd.DataFrame(dict(x=[0, 1, 2],
                       y=[0, 0, 5]))

Then, as you pass your dataframe to a function, I will do the same but for 2 almost identical functions:

def func(dfx):
    # Analog of your df = df[df.PEER_SRC_IP.isin(lut_ipaddr)]
    dfx = dfx[dfx['x'] > 1.5]
    # Analog of your df.replace("", numpy.nan, inplace=True)
    dfx.replace(5, np.nan, inplace=True)
def func_with_copy(dfx):
    dfx = dfx[dfx['x'] > 1.5].copy()  # explicitly making a copy
    dfx.replace(5, np.nan, inplace=True)

Now let's call them for initial df:

func_with_copy(df)
print(df)

gives

   x  y
0  0  0
1  1  0
2  2  5

and no warning. And calling this:

func(df)
print(df)

gives the same output:

   x  y
0  0  0
1  1  0
2  2  5

but with the warning:

/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

So this looks like a 'false positive'. Here is a good comment on false positives: link

Strange thing here is that if you do exactly the same manipulations with your dataframe but without passing it to a function, then you won't see this warning. ¯\_(ツ)_/¯

My advice is to use .copy()

Georgy
  • 12,464
  • 7
  • 65
  • 73