4

Let's say I have a dataframe with two datetime columns and I want to analyze the difference between them:

import pandas as pd

csv = [
         ['2019-08-03 00:00:00', '2019-08-01 15:00:00', 4],
         ['2019-08-03 00:00:00', '2019-08-01 10:00:00', 6],
         ['2019-08-03 00:00:00', '2019-08-01 16:00:00', 8],
         ['2019-08-04 00:00:00', '2019-08-02 19:00:00', 3],
         ['2019-08-04 00:00:00', '2019-08-02 13:00:00', 4],
         ['2019-08-04 00:00:00', '2019-08-02 11:00:00', 5]
]

df = pd.DataFrame(csv, columns=['delivery_date', 'dispatch_date', 'order_size'])
df['delivery_date'] = pd.to_datetime(df['delivery_date'])
df['dispatch_date'] = pd.to_datetime(df['dispatch_date'])
df['transit_time'] = (df['delivery_date']-df['dispatch_date'])
df = df.set_index(['delivery_date','transit_time'])

Ok so now we have something like that:

                                    dispatch_date  order_size
delivery_date transit_time                                   
2019-08-03    1 days 09:00:00 2019-08-01 15:00:00           4
              1 days 14:00:00 2019-08-01 10:00:00           6
              1 days 08:00:00 2019-08-01 16:00:00           8
2019-08-04    1 days 05:00:00 2019-08-02 19:00:00           3
              1 days 11:00:00 2019-08-02 13:00:00           4
              1 days 13:00:00 2019-08-02 11:00:00           5

Let's say for example that, for each delivery date, I want to know which delivery was the fastest (shortest delivery time). I want to save the result to a new dataframe with all the columns from the original dataframe. So I iterate like this:

delivery_dates = df.index.get_level_values(0).unique()
df_ouput = pd.DataFrame()

for date in delivery_dates:    
    df_analyzed = df.loc[(date, )].sort_index()
    df_result = df_analyzed.iloc[[df_analyzed.index.get_loc(0, method='nearest')]]    
    df_result.loc[:,'delivery_date'] = date
    df_ouput = df_ouput.append(df_result)

df_ouput = df_ouput.reset_index().set_index(['delivery_date'])

And the result is correct:

                 transit_time       dispatch_date  order_size
delivery_date                                                
2019-08-03    1 days 08:00:00 2019-08-01 16:00:00           8
2019-08-04    1 days 05:00:00 2019-08-02 19:00:00           3

But I get the warning:

A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

And I don't know why because I am already using the ".loc" method for assignation:

df_result.loc[:,'delivery_date'] = date

But I can't get rid of the warning, so I came to this rare solution:

delivery_dates = df.index.get_level_values(0).unique()
df_ouput = pd.DataFrame()

for date in delivery_dates:    
    df_analyzed = df.loc[(date, )].sort_index()
    df_result = df_analyzed.iloc[[df_analyzed.index.get_loc(0, method='nearest')]]    
    df_result_2 = df_result.copy()
    df_result_2.loc[:,'delivery_date'] = date
    df_ouput = df_ouput.append(df_result_2)

df_ouput = df_ouput.reset_index().set_index(['delivery_date'])

If a make a copy, then no warning is displayed. But why? Is there a better way to do what I want?

eliteA92
  • 371
  • 3
  • 11
  • https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas – PV8 Aug 14 '19 at 07:51

2 Answers2

3

Your solution should be changed with copy for filtering:

delivery_dates = df.index.get_level_values(0).unique()
df_ouput = pd.DataFrame()

for date in delivery_dates:    
    df_analyzed = df.loc[date].sort_index()
    df_result = df_analyzed.iloc[[df_analyzed.index.get_loc(0, method='nearest')]].copy()    
    df_result['delivery_date'] = date
    df_ouput = df_ouput.append(df_result)

df_ouput = df_ouput.reset_index().set_index(['delivery_date'])
print (df_ouput)
                 transit_time       dispatch_date  order_size
delivery_date                                                
2019-08-03    1 days 08:00:00 2019-08-01 16:00:00           8
2019-08-04    1 days 05:00:00 2019-08-02 19:00:00           3

Better solution with custom function in GroupBy.apply:

def f(x):
    x = x.sort_index(level=1)
    s = x.iloc[[x.index.get_level_values(1).get_loc(0, method='nearest')]]
    return s

df = df.groupby(level=0).apply(f).reset_index(level=0, drop=True)
print (df)
                                    dispatch_date  order_size
delivery_date transit_time                                   
2019-08-03    1 days 08:00:00 2019-08-01 16:00:00           8
2019-08-04    1 days 05:00:00 2019-08-02 19:00:00           3

Or:

def f(x):
    x = x.sort_index(level=1)
    s = x.iloc[[x.index.get_level_values(1).get_loc(0, method='nearest')]]
    return s

df = df.groupby(level=0, group_keys=False).apply(f)
print (df)
                                    dispatch_date  order_size
delivery_date transit_time                                   
2019-08-03    1 days 08:00:00 2019-08-01 16:00:00           8
2019-08-04    1 days 05:00:00 2019-08-02 19:00:00           3

If understand well:

df = df.sort_index()
df = df[~df.index.get_level_values(0).duplicated()]
print (df)
                                    dispatch_date  order_size
delivery_date transit_time                                   
2019-08-03    1 days 08:00:00 2019-08-01 16:00:00           8
2019-08-04    1 days 05:00:00 2019-08-02 19:00:00           3
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Nice solution using the `def f(x)` approach. Thank you. However, I don't understand why the `GroupBy.apply()` duplicates the first index and then you have to do the `reset_index` As for my approach, I still don't get why you need to use the `.copy()`. From my understanding, the `df_analyzed.iloc[]` is already returning a new dataframe which you are storing as `df_result`. So why would you copy it? – eliteA92 Aug 14 '19 at 10:18
  • 1
    @eliteA92 - because it return for each group new DataFrame, added new solution with `group_keys=False` for avoid it. – jezrael Aug 14 '19 at 10:22
  • 1
    I understand, thanks. As for my approach, I still don't get why you need to use the .copy(). From my understanding, the df_analyzed.iloc[] is already returning a new dataframe which you are storing as df_result. So why would you copy it? – eliteA92 Aug 14 '19 at 10:37
  • 1
    @eliteA92 - I think not, maybe the best explanation is [here](https://stackoverflow.com/a/53954986/2901002) – jezrael Aug 14 '19 at 10:39
1

Very simple code to get the requested output

The way the OP tried to answer the question was very complicated. Setting indices up just makes everything more difficult in this situation.

First, we create the dataframe. There is no need to set an index.

import pandas as pd

csv = [
         ['2019-08-03 00:00:00', '2019-08-01 15:00:00', 4],
         ['2019-08-03 00:00:00', '2019-08-01 10:00:00', 6],
         ['2019-08-03 00:00:00', '2019-08-01 16:00:00', 8],
         ['2019-08-04 00:00:00', '2019-08-02 19:00:00', 3],
         ['2019-08-04 00:00:00', '2019-08-02 13:00:00', 4],
         ['2019-08-04 00:00:00', '2019-08-02 11:00:00', 5]
]

df = pd.DataFrame(csv, columns=['delivery_date', 'dispatch_date', 'order_size'])
df['delivery_date'] = pd.to_datetime(df['delivery_date'])
df['dispatch_date'] = pd.to_datetime(df['dispatch_date'])
df['transit_time'] = (df['delivery_date']-df['dispatch_date'])
# df = df.set_index(['delivery_date','transit_time']) # <------- Just remove this!

After that, the needed operations are:

df_ouput = df.loc[df.groupby('delivery_date').transit_time.idxmin()] # <--- Only these operations are needed!

# The next line is just in case you want to set delivery_date as your index
df_ouput = df_ouput.reset_index(drop=True).set_index('delivery_date')

print(df_ouput)
                    dispatch_date  order_size    transit_time
delivery_date                                                
2019-08-03    2019-08-01 16:00:00           8 1 days 08:00:00
2019-08-04    2019-08-02 19:00:00           3 1 days 05:00:00

The groupby method allows to split rows in a DataFrame based on some criteria. In this case, we are splitting rows given they have the same delivery_date value. And then, idxmin which is a method of GroupBy objects, gives the index for the minimum value on a given axis (refer to the documentation of idxmin here). I took this idea from this answer to the question Pandas GroupBy and select rows with the minimum value in a specific column, which is exactly the same type of problem.

Note that we used loc when slicing df and provided a numerical value for the index.

df_output = df.loc[df.groupby('delivery_date').transit_time.idxmin()]

This is valid because the index for df consists of integer values in our case, because we left the DataFrame create the indices by default.


The reason SettingWithCopyWarning is raised

I will construct the logic of my final answer by first constructing the argument. I dissected the part that produces the warning and the line that produces it is:

for date in delivery_dates:    
    df_analyzed = df.loc[(date, )].sort_index()
    df_result = df_analyzed.iloc[[df_analyzed.index.get_loc(0, method='nearest')]]    
    df_result.loc[:,'delivery_date'] = date # <------- THIS IS THE LINE!
    df_ouput = df_ouput.append(df_result)

I looked for methods to create a new column using pandas' DataFrames: check here.

When using DataFrame.assign() as this:

delivery_dates = df.index.get_level_values(0).unique()
df_ouput = pd.DataFrame()

for date in delivery_dates:    
    df_analyzed = df.loc[(date, )].sort_index()
    df_result = df_analyzed.iloc[[df_analyzed.index.get_loc(timedelta(0, 0, 0), method='nearest')]]
    # Assigning the value using the assign method doesn't raise the warning.
    df_result = df_result.assign(delivery_date = [date])
    df_ouput = df_ouput.append(df_result)

df_ouput = df_ouput.reset_index().set_index(['delivery_date'])

I get no warning.

Quoting this answer from the question How to deal with SettingWithCopyWarning in Pandas.

The SettingWithCopyWarning was created to flag potentially confusing "chained" assignments

The explanation for why the warning is raised is then, that the code you used involves a potentially confusing "chained" assignment. More specifically, the potentially confusing "chained" assignments come from the two slices you did in:

    # Most likely the potentially confusing "chained" assignments from from
    # these two lines
    df_result = df_analyzed.iloc[[df_analyzed.index.get_loc(0, method='nearest')]]    
    df_result.loc[:,'delivery_date'] = date

Most likely, the lines that cause the warning are creating a copy of the object and trying to assign a value to that copy. This could potentially left unmodified the data you want to change.


Last thoughs

Determining why these are potentially confusing "chained" assignments is usually difficult. For example, in this other answer to How to deal with SettingWithCopyWarning in Pandas, it says:

The problem with chained assignment, is that it is generally difficult to predict whether a view or a copy is returned, so this largely becomes an issue when you are attempting to assign values back.

We should always use the least ambiguous methods when working with assignments in pandas in order to avoid errors or appearances of this warning.