1

So I'm working on a project where I need to identify multiple device launch indicators. In order to identify the launch indicators I need the line of code to search through a list of unique dates in a series and subtract the dates that have the same year. So if the SV_DATE was 2015/03/05, the code would look through the series 'Launch Date' to find a match in the year (2015/06/22 for example), and subtract the dates. The between() function checks to see if the result within the range of 0 and 30 days and returns a Boolean. And lastly astype(int) returns a 1 if True

When I run the code I come across two error messages. The first error has to do with the the truth value being ambiguous due to my comparing two columns.

def day_diff(end,start):

    ed = pd.to_datetime(end)

    sd = pd.to_datetime(start)

    #if ed.dt.year == sd.year:

    return (ed-sd).dt.days

 data['AL030'] = day_diff(data['SV_DATE'],data_2.loc[(data_2['MFG'] == 'APPLE') & (pd.Series(pd.DatetimeIndex(data_2['Launch Date'])).dt.year == pd.Series(pd.DatetimeIndex(data['SV_DATE'])).dt.year), 'Launch Date']).between(0,30).astype(int)

In order for the code to run, I need to hard code the year, instead of having the code search through a column of dates. When I do this then the code works.

data['AL030'] = day_diff(data['SV_DATE'],data_2.loc[(data_2['MFG'] == 'APPLE') & (pd.Series(pd.DatetimeIndex(data_2['Launch Date'])).dt.year == 2017), 'Launch Date'].apply(lambda x:x.date().strftime('%Y-%m-%d'))).between(0,30).astype(int)

I'm getting this error before I even add the unique() function to it, which gives me a new error: 'ValueError: cannot add indices of unequal length'

data['AL030'] = day_diff(data['SV_DATE'],data_2.loc[(data_2['MFG'] == 'APPLE') & (pd.Series(pd.DatetimeIndex(data_2['Launch Date'])).dt.year == 2017), 'Launch Date'].apply(lambda x:x.date().strftime('%Y-%m-%d')).unique()).between(0,30).astype(int)

If I didn't want to compare the years between the columns, this piece of code would have sufficed:

data['AL030'] = day_diff(data['SV_DATE'],data_2.loc[(data_2['MFG'] == 'APPLE'), 'Launch Date']).between(0,60).astype(int)

At the end of the day, I'm trying to optimize this piece of code in R to return the same value without utilizing a function like this launch.ind one, while dually adding the year condition to try to cut down on run time:

day_diff = function(end,start){

  x = difftime(end,start,units=c("days"))

  return(x)

}

 

launch.ind = function(ship.date,launch.date,low,high){

  y = rep(0,length(data$SV_DATE))

  for (i in seq(length(data$SV_DATE))){

    y[i] = sum(ifelse((day_diff(ship.date[i],launch.date)>=low)&(day_diff(ship.date[i],launch.date)<=high),1,0))

    y[i] = ifelse(y[i] > 0, 1, 0)

  }

  return(y)

}
###############################

# Add launch indicators

data$AL030 = launch.ind(data$SV_DATE,unique(data_2$"Launch Date"[toupper(data_2$MFG)=="APPLE"]),0,30)

I appreciate anyone attempting to help and I'm open to suggestions to help clarify anything that was unclear

Grayv
  • 27
  • 4
  • Please post sample of `data` and `data2`. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). Aside, it is almost rarely a good idea to compare columns from *different* data frames. Try merging then running conditional logic. – Parfait Nov 03 '20 at 15:10
  • In the R code, how many values are in vector `unique(data_2$"Launch Date"[toupper(data_2$MFG) == "APPLE"]`? It should be one or `ifelse` would throw an error due to different lengths. Actually, the R and Pandas code is not consistent. R is not returning number of days but number of times the condition is met between low and high thresholds (i.e., between 0 and 30 days). – Parfait Nov 03 '20 at 15:15

1 Answers1

0

Python


Fake data:

import pandas as pd

data_1 = pd.DataFrame({
    'SV_DATE': pd.to_datetime(['2015/03/05', '2015/03/10', '2016/01/01'])
})

data_2 = pd.DataFrame({
    'Launch Date': pd.to_datetime(['2015/03/05', '2015/12/01', '2016/01/01', '2017/01/01']),
    'MFG': ['APPLE', 'WINDOWS', 'APPLE', 'WINDOWS']
})

print(data_1)

     SV_DATE
0 2015-03-05
1 2015-03-10
2 2016-01-01

print(data_2)

  Launch Date      MFG
0  2015-03-05    APPLE
1  2015-12-01  WINDOWS
2  2016-01-01    APPLE
3  2017-01-01  WINDOWS

If I got it right, you can merge filter data_2 (only lines with MFG==APPLE), merge both dataframes by Year, calculate the difference between dates by Year, then verify if they are inside your desired range (0,30):

data_1 = data_1.assign(Year = data_1.SV_DATE.dt.year, Index = data_1.index)
data_2 = data_2.assign(Year = data_2['Launch Date'].dt.year).query('MFG=="APPLE"')

data = data_1.merge(data_2, on='Year')
data['Diff'] = data.groupby('Year')[['Launch Date','SV_DATE']].transform('diff', axis=1)['SV_DATE'].dt.days
data['in_target_range'] = data.Diff.between(0,30)

Output:

     SV_DATE  Year  Index Launch Date    MFG  Diff  in_target_range
0 2015-03-05  2015      0  2015-03-05  APPLE     0             True
1 2015-03-10  2015      1  2015-03-05  APPLE     5             True
2 2016-01-01  2016      2  2016-01-01  APPLE     0             True

With this output you can do whatever you wanna do, I suppose. Note that I kept the an Index column in order to retrieve those lines in data_1 if you'd like to.

R


A similar approach using R:

library(dplyr)

# Fake data
data_1 <- data.frame(SV_DATE = as.Date(c('2015/03/05', '2015/03/10', '2016/01/01')))

data_2 <- data.frame (
  Launch_Date = as.Date(c('2015/03/05', '2015/12/01', '2016/01/01', '2017/01/01')),
  MFG = c('APPLE', 'WINDOWS', 'APPLE', 'WINDOWS')
)

# Merge and filters
data_2 <- data_2 %>%
  mutate(Year = format(Launch_Date, "%Y")) %>%
  filter(MFG=="APPLE")

data <- data_1 %>% 
  mutate(Year = format(SV_DATE, "%Y"), Index = 1:nrow(.)) %>%
  inner_join(., mutate(data_2, Year=format(Launch_Date, "%Y")), by = "Year") %>%
  group_by(Year) %>%
  mutate(Diff = as.integer(SV_DATE - Launch_Date)) %>%
  mutate(in_target_range = between(Diff, 0, 30))

which output is:

# A tibble: 3 x 7
# Groups:   Year [2]
  SV_DATE    Year  Index Launch_Date MFG    Diff in_target_range
  <date>     <chr> <int> <date>      <chr> <int> <lgl>          
1 2015-03-05 2015      1 2015-03-05  APPLE     0 TRUE           
2 2015-03-10 2015      2 2015-03-05  APPLE     5 TRUE           
3 2016-01-01 2016      3 2016-01-01  APPLE     0 TRUE           

I don't know what you really want with your launch.ind function, but it might be something like this (?):

low = 0
high = 3

data$AL030 <- data %>% 
  group_by(SV_DATE) %>%
  summarise(launch.ind = sum(ifelse(between(Diff, low, high), 1, 0)), .groups='drop') %>%
  mutate(launch.ind = ifelse(launch.ind > 0, 1, 0)) %>%
  pull(launch.ind)

Notes


Although this code works for the fake data I provided, it might not work for you. In any case, I believe it provides some ways to achieve your goal by modifying it.

Also, note that I left in_target_range as boolean in both code chunks, but you can easily change it to integer with .astype(int) and as.integer(...) in Python and R, respectively.

Cainã Max Couto-Silva
  • 4,839
  • 1
  • 11
  • 35