2

I'm manually filling up some categorical labels in a column which is is_holiday. There are rows that have only the first observation with the correct label, the rest are 0 and I'm filling these zeros:

# Before filling
print(main_data.loc["25 December, 2012"].is_holiday)

date_time
2012-12-25 00:00:00    3
2012-12-25 01:00:00    0
2012-12-25 02:00:00    0
2012-12-25 03:00:00    0
.
.
.
2012-12-25 19:00:00    0
2012-12-25 20:00:00    0
2012-12-25 21:00:00    0
2012-12-25 22:00:00    0
2012-12-25 23:00:00    0
Name: is_holiday, dtype: int64

# Manually fill
for row in range(len(main_data.loc["2012"])):
    if main_data.Month[row] == 12 and main_data.Day[row] == 25:
        if main_data.is_holiday[row] == 0:
            main_data.is_holiday[row] = 3  # 3 is label for Xmas

# After filling
print(main_data.loc["25 December, 2012"].is_holiday)
2012-12-25 00:00:00    3
2012-12-25 01:00:00    3
2012-12-25 02:00:00    3
2012-12-25 03:00:00    3
.
.
.
2012-12-25 19:00:00    3
2012-12-25 20:00:00    3
2012-12-25 21:00:00    3
2012-12-25 22:00:00    3
2012-12-25 23:00:00    3
Name: is_holiday, dtype: int64

However, if I execute the same code for another year, say 2013, the values remain the same. I tried modifying the code including minor changes but the holiday labels won't change:

# Before filling
print(main_data.loc["25 December, 2013"].is_holiday)

date_time
2013-12-25 00:00:00    3
2013-12-25 01:00:00    0
2013-12-25 02:00:00    0
2013-12-25 03:00:00    0
.
.
.
2013-12-25 19:00:00    0
2013-12-25 20:00:00    0
2013-12-25 21:00:00    0
2013-12-25 22:00:00    0
2013-12-25 23:00:00    0
Name: is_holiday, dtype: int64

# Manually fill
for row_2 in range(len(main_data.loc["2013"])):
    if main_data.Month[row_2] == 12 and main_data.Day[row_2] == 25:
        if main_data.is_holiday[row_2] == 0:
            main_data.is_holiday[row_2] = 3  # 3 is label for Xmas

# After filling
print(main_data.loc["25 December, 2013"].is_holiday)
2013-12-25 00:00:00    3
2013-12-25 01:00:00    0
2013-12-25 02:00:00    0
2013-12-25 03:00:00    0
.
.
.
2013-12-25 19:00:00    0
2013-12-25 20:00:00    0
2013-12-25 21:00:00    0
2013-12-25 22:00:00    0
2013-12-25 23:00:00    0
Name: is_holiday, dtype: int64

What am I missing here? I did not expect this to happen. It works as expected for some holidays while it won't for others. Please do keep in mind that Month and Day are separate columns that I have engineered in main_data.

Edit: I'm welcome to better methods for achieving this.

shiv_90
  • 1,025
  • 3
  • 12
  • 35
  • What's the index in your main data frame – ifly6 Aug 21 '19 at 12:25
  • It's the datetime stamps with the column name `date_time`. – shiv_90 Aug 21 '19 at 12:26
  • 2
    Use [DataFrame.where](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html) or [np.where](https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html) to update values based on conditions. – SmileyProd Aug 21 '19 at 12:29
  • The index doesn't have a column name, I don't get it. What's your index? – ifly6 Aug 21 '19 at 12:32
  • @ifly6 The name itself is `date_time` and includes timestamps from 2012-10-02 to 2017-05-17. – shiv_90 Aug 21 '19 at 12:34
  • @SmileyProd Can you show an example? – shiv_90 Aug 21 '19 at 12:36
  • 1
    I'm not sure how you use `.loc` then, because afaik it doesn't match on such data. `np.where(cond, if_true, if_false)` in which all are series with the same length. The `cond` is the same format as what you would use for a subsetting command to get what you wnat. – ifly6 Aug 21 '19 at 12:38
  • Well `.loc ` worked for New Year 2013. Before I set the condition in `[]` without using `.loc` and it worked the same way as `.loc`. So I Only made the code more readable by inserting `.loc`. But I'll try using `.where`. Thanks! – shiv_90 Aug 21 '19 at 12:41

4 Answers4

2

Pandas has some nice builtin functions for this kind of task. So you could just do

import numpy as np
main_data.is_holiday.replace(0, np.nan, inplace=True)
main_data.is_holiday.fillna(method='ffill', inplace=True)

should do it.

EDIT:

To apply the fill to the holidays only:

main_data['is_holiday'] = main_data.is_holiday.resample('D').max()
main_data.is_holiday.fillna(method='ffill', inplace=True)

The first line replaces the is_holiday column with the maximum value for that day for the midnight timestamp, which is assumed to be present. The second line fills the other rows, which should by now have a NaN.

ilmiacs
  • 2,566
  • 15
  • 20
2

assuming the date is the index i would simply do:

df.loc[(df.index.day == 25) & (df.index.month ==12),['is_holiday']] = 3

full code with example:

#init data
df = pd.DataFrame(          \    
[['2012-12-25 19:00:00',0,0], \
['2012-12-25 20:00:00',0,0],  \
['2012-12-26 19:00:00',0,0],  \
['2012-12-26 20:00:00',0,0],  \
['2013-01-25 19:00:00',0,0],  \
['2013-01-25 20:00:00',0,0],  \
['2013-12-25 19:00:00',0,0],  \
['2013-12-25 20:00:00',0,0]])

#set index
df.columns = ['date','is_holiday','some_value']
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')

#do the search
df.loc[(df.index.day == 25) & (df.index.month ==12),['is_holiday']] = 3
print(df)

result:

                     is_holiday  value
date                                  
2012-12-25 19:00:00  3           0    
2012-12-25 20:00:00  3           0    
2012-12-26 19:00:00  0           0    
2012-12-26 20:00:00  0           0    
2013-01-25 19:00:00  0           0    
2013-01-25 20:00:00  0           0    
2013-12-25 19:00:00  3           0    
2013-12-25 20:00:00  3           0 

some useful links:

https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

Select rows from a DataFrame based on values in a column in pandas

António Caeiro
  • 135
  • 1
  • 9
1

What you are actually doing

for row in range(len(main_data.loc["2012"]))
#let's says len(main_data.loc["2012"]) = 100

You select line 0 to 100 and replace 0 by 3.

for row_2 in range(len(main_data.loc["2013"]))
#let's says len(main_data.loc["2013"]) = 50

You select line 0 to 50 and replace 0 by 3.

You are doing this in you'r main data set so both code will replace first line of the data set.

You can solve it using something like (pseudo code):

for row in np.where(years=2013,month==12)

akhetos
  • 686
  • 1
  • 10
  • 31
  • But am I not being specific in mentioning the years? I thought specifying years should run the operations separately and not for the whole dataframe as such. – shiv_90 Aug 21 '19 at 12:56
1

As I suggested in the comment you can use DataFrame.where or np.where to do the assignments but I think the best in your case is DataFrame.where:

mask = (main_data.date_time.dt.year == 2012) & (main_data.Month == 12) & (main_data.Day == 25)
main_data[mask].is_holiday = main_data[mask]\
                                    .is_holiday\
                                    .where(~(main_data[mask].is_holiday == 0)), 3)
  1. numpy.where
main_data[mask].is_holiday = np.where(main_data[mask].is_holiday == 0, 3,\
                                             main_data[mask].is_holiday)
SmileyProd
  • 788
  • 4
  • 13