Why does this code work in one scenario while it won't in another?

Question

I'm manually filling up some categorical labels in a column which is is_holiday. There are rows that have only the first observation with the correct label, the rest are 0 and I'm filling these zeros:

# Before filling
print(main_data.loc["25 December, 2012"].is_holiday)

date_time
2012-12-25 00:00:00    3
2012-12-25 01:00:00    0
2012-12-25 02:00:00    0
2012-12-25 03:00:00    0
.
.
.
2012-12-25 19:00:00    0
2012-12-25 20:00:00    0
2012-12-25 21:00:00    0
2012-12-25 22:00:00    0
2012-12-25 23:00:00    0
Name: is_holiday, dtype: int64

# Manually fill
for row in range(len(main_data.loc["2012"])):
    if main_data.Month[row] == 12 and main_data.Day[row] == 25:
        if main_data.is_holiday[row] == 0:
            main_data.is_holiday[row] = 3  # 3 is label for Xmas

# After filling
print(main_data.loc["25 December, 2012"].is_holiday)
2012-12-25 00:00:00    3
2012-12-25 01:00:00    3
2012-12-25 02:00:00    3
2012-12-25 03:00:00    3
.
.
.
2012-12-25 19:00:00    3
2012-12-25 20:00:00    3
2012-12-25 21:00:00    3
2012-12-25 22:00:00    3
2012-12-25 23:00:00    3
Name: is_holiday, dtype: int64

However, if I execute the same code for another year, say 2013, the values remain the same. I tried modifying the code including minor changes but the holiday labels won't change:

# Before filling
print(main_data.loc["25 December, 2013"].is_holiday)

date_time
2013-12-25 00:00:00    3
2013-12-25 01:00:00    0
2013-12-25 02:00:00    0
2013-12-25 03:00:00    0
.
.
.
2013-12-25 19:00:00    0
2013-12-25 20:00:00    0
2013-12-25 21:00:00    0
2013-12-25 22:00:00    0
2013-12-25 23:00:00    0
Name: is_holiday, dtype: int64

# Manually fill
for row_2 in range(len(main_data.loc["2013"])):
    if main_data.Month[row_2] == 12 and main_data.Day[row_2] == 25:
        if main_data.is_holiday[row_2] == 0:
            main_data.is_holiday[row_2] = 3  # 3 is label for Xmas

# After filling
print(main_data.loc["25 December, 2013"].is_holiday)
2013-12-25 00:00:00    3
2013-12-25 01:00:00    0
2013-12-25 02:00:00    0
2013-12-25 03:00:00    0
.
.
.
2013-12-25 19:00:00    0
2013-12-25 20:00:00    0
2013-12-25 21:00:00    0
2013-12-25 22:00:00    0
2013-12-25 23:00:00    0
Name: is_holiday, dtype: int64

What am I missing here? I did not expect this to happen. It works as expected for some holidays while it won't for others. Please do keep in mind that Month and Day are separate columns that I have engineered in main_data.

Edit: I'm welcome to better methods for achieving this.

Use [DataFrame.where](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html) or [np.where](https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html) to update values based on conditions. — SmileyProd, Aug 21 '19 at 12:29
The index doesn't have a column name, I don't get it. What's your index? — ifly6, Aug 21 '19 at 12:32
@ifly6 The name itself is `date_time` and includes timestamps from 2012-10-02 to 2017-05-17. — shiv_90, Aug 21 '19 at 12:34
I'm not sure how you use `.loc` then, because afaik it doesn't match on such data. `np.where(cond, if_true, if_false)` in which all are series with the same length. The `cond` is the same format as what you would use for a subsetting command to get what you wnat. — ifly6, Aug 21 '19 at 12:38
Well `.loc ` worked for New Year 2013. Before I set the condition in `[]` without using `.loc` and it worked the same way as `.loc`. So I Only made the code more readable by inserting `.loc`. But I'll try using `.where`. Thanks! — shiv_90, Aug 21 '19 at 12:41

ilmiacs · Answer 1 · 2019-08-21T13:23:13.973

2

Pandas has some nice builtin functions for this kind of task. So you could just do

import numpy as np
main_data.is_holiday.replace(0, np.nan, inplace=True)
main_data.is_holiday.fillna(method='ffill', inplace=True)

should do it.

EDIT:

To apply the fill to the holidays only:

main_data['is_holiday'] = main_data.is_holiday.resample('D').max()
main_data.is_holiday.fillna(method='ffill', inplace=True)

The first line replaces the is_holiday column with the maximum value for that day for the midnight timestamp, which is assumed to be present. The second line fills the other rows, which should by now have a NaN.

edited Aug 21 '19 at 13:23

answered Aug 21 '19 at 12:43

ilmiacs

2,566
15
20

But for days that had no holidays the label remains 0. Won't your code disturb those correct zeros? – shiv_90 Aug 21 '19 at 12:49
Ah, I see. True. Hold on. – ilmiacs Aug 21 '19 at 12:50
Wow. That is good. Can you explain more in detail the first line of code? What does resample with 'D' do and how does lambda work here? – shiv_90 Aug 21 '19 at 13:00
Unfortunately I get this error: `IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).` – shiv_90 Aug 21 '19 at 13:03
Ok, I do not know what is up. I will try once more. – ilmiacs Aug 21 '19 at 13:19
Thats fine. The logic is workable but only getting index error. Thanks for answering. – shiv_90 Aug 21 '19 at 13:22

score 2 · Accepted Answer · answered Aug 21 '19 at 13:07

assuming the date is the index i would simply do:

df.loc[(df.index.day == 25) & (df.index.month ==12),['is_holiday']] = 3

full code with example:

#init data
df = pd.DataFrame(          \    
[['2012-12-25 19:00:00',0,0], \
['2012-12-25 20:00:00',0,0],  \
['2012-12-26 19:00:00',0,0],  \
['2012-12-26 20:00:00',0,0],  \
['2013-01-25 19:00:00',0,0],  \
['2013-01-25 20:00:00',0,0],  \
['2013-12-25 19:00:00',0,0],  \
['2013-12-25 20:00:00',0,0]])

#set index
df.columns = ['date','is_holiday','some_value']
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')

#do the search
df.loc[(df.index.day == 25) & (df.index.month ==12),['is_holiday']] = 3
print(df)

result:

                     is_holiday  value
date                                  
2012-12-25 19:00:00  3           0    
2012-12-25 20:00:00  3           0    
2012-12-26 19:00:00  0           0    
2012-12-26 20:00:00  0           0    
2013-01-25 19:00:00  0           0    
2013-01-25 20:00:00  0           0    
2013-12-25 19:00:00  3           0    
2013-12-25 20:00:00  3           0

some useful links:

https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

Select rows from a DataFrame based on values in a column in pandas

Thanks for the solution. Worked lightning fast. – shiv_90 Aug 21 '19 at 13:12 — shiv_90, Aug 21 '19 at 13:12

score 1 · Answer 3 · answered Aug 21 '19 at 12:44

What you are actually doing

for row in range(len(main_data.loc["2012"]))
#let's says len(main_data.loc["2012"]) = 100

You select line 0 to 100 and replace 0 by 3.

for row_2 in range(len(main_data.loc["2013"]))
#let's says len(main_data.loc["2013"]) = 50

You select line 0 to 50 and replace 0 by 3.

You are doing this in you'r main data set so both code will replace first line of the data set.

You can solve it using something like (pseudo code):

for row in np.where(years=2013,month==12)

But am I not being specific in mentioning the years? I thought specifying years should run the operations separately and not for the whole dataframe as such. — shiv_90, Aug 21 '19 at 12:56

SmileyProd · Answer 4 · 2019-08-21T13:08:08.447

As I suggested in the comment you can use DataFrame.where or np.where to do the assignments but I think the best in your case is DataFrame.where:

mask = (main_data.date_time.dt.year == 2012) & (main_data.Month == 12) & (main_data.Day == 25)
main_data[mask].is_holiday = main_data[mask]\
                                    .is_holiday\
                                    .where(~(main_data[mask].is_holiday == 0)), 3)

numpy.where

main_data[mask].is_holiday = np.where(main_data[mask].is_holiday == 0, 3,\
                                             main_data[mask].is_holiday)

Why does this code work in one scenario while it won't in another?

4 Answers4