0

I am write a very simple if else loop to create a new variable that bins another variable into quartiles. This seems to be a very simple procedure, however the loop groups all of my data into the median and third quartile (which violates the definition of a quartile).

Here is the structure of my data:

> str(tmp)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   435 obs. of  12 variables:
 $ CD112FP             : chr  "01" "02" "03" "04" ...
 $ State               : chr  "ALABAMA" "ALABAMA" "ALABAMA" "ALABAMA" ...
 $ Year                : num  2011 2011 2011 2011 2011 ...
 $ Alignment           : num  0 0 0 0 0 0 1 0 0 0 ...
 $ State_Aligned       : num  0 0 0 0 0 0 0 1 0 0 ...
 $ PercentFunding      : num  0.0658 0.29 0.6764 0.0174 0.047 ...
 $ fips                : chr  "01" "01" "01" "01" ...
 $ ssa                 : int  1 1 1 1 1 1 1 NA 3 3 ...
 $ region              : int  3 3 3 3 3 3 3 NA 4 4 ...
 $ division            : int  6 6 6 6 6 6 6 NA 8 8 ...
 $ abb                 : chr  "AL" "AL" "AL" "AL" ...
 $ PercentFundingBinned: chr  "0.0625-0.1799" "0.0625-0.1799" "0.0625-0.1799" "0.0625-0.1799" ...

and this is the head of my data:

 head(tmp)
# A tibble: 6 x 12
  CD112FP State    Year Alignment State_Aligned PercentFunding fips    ssa region division abb   PercentFundingBinned
  <chr>   <chr>   <dbl>     <dbl>         <dbl>          <dbl> <chr> <int>  <int>    <int> <chr> <chr>               
1 01      ALABAMA  2011         0             0         0.0658 01        1      3        6 AL    0.0625-0.1799       
2 02      ALABAMA  2011         0             0         0.290  01        1      3        6 AL    0.0625-0.1799       
3 03      ALABAMA  2011         0             0         0.676  01        1      3        6 AL    0.0625-0.1799       
4 04      ALABAMA  2011         0             0         0.0174 01        1      3        6 AL    0.0625-0.1799       
5 05      ALABAMA  2011         0             0         0.0470 01        1      3        6 AL    0.0625-0.1799       
6 06      ALABAMA  2011         0             0         0.0440 01        1      3        6 AL    0.0625-0.1799       

I am using the following if else loop:

  tmp$PercentFundingBinned <- NULL
  if (tmp$PercentFunding >= quantile(tmp$PercentFunding, 0.75)) {
    tmp$PercentFundingBinned <- paste0(round(quantile(tmp$PercentFunding, 0.75), 4), "-",
                                       round(max(tmp$PercentFundingBinned), 4))
  } else if (tmp$PercentFunding >= median(tmp$PercentFunding)){
    tmp$PercentFundingBinned <- paste0(round(median(tmp$PercentFunding),4), "-", 
                                       round(quantile(tmp$PercentFunding, 0.75),4))
  } else if (tmp$PercentFunding >= quantile(tmp$PercentFunding, 0.25)){
    tmp$PercentFundingBinned <- paste0(round(quantile(tmp$PercentFunding, 0.25),4), "-", 
                                       round(median(tmp$PercentFunding),4))
  } else {
    tmp$PercentFundingBinned <- paste0(round(min(tmp$PercentFunding),4), "-", 
                                             round(quantile(tmp$PercentFunding, 0.25),4))
  }

and it returns the following category:

unique(tmp$PercentFundingBinned)
[1] "0.0625-0.1799"

Not sure what to do or how to fit it. This seems like its should be a really easy procedure. Any advice helps, thank you!

benalbert342
  • 71
  • 1
  • 4
  • Please create a smaller example with more focus on the problem. – G. Grothendieck Mar 13 '20 at 21:19
  • You should provide a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). It should be [minimal, but complete and verifiable example](https://stackoverflow.com/help/minimal-reproducible-example). Your question should be clear and specific. – M-- Mar 13 '20 at 21:31

2 Answers2

2

I suggest you don't need ifelse at all.

tmp <- read.table(header=TRUE, stringsAsFactors=FALSE, text="
  CD112FP State    Year Alignment State_Aligned PercentFunding fips    ssa region division abb   PercentFundingBinned
1 01      ALABAMA  2011         0             0         0.0658 01        1      3        6 AL    0.0625-0.1799       
2 02      ALABAMA  2011         0             0         0.290  01        1      3        6 AL    0.0625-0.1799       
3 03      ALABAMA  2011         0             0         0.676  01        1      3        6 AL    0.0625-0.1799       
4 04      ALABAMA  2011         0             0         0.0174 01        1      3        6 AL    0.0625-0.1799       
5 05      ALABAMA  2011         0             0         0.0470 01        1      3        6 AL    0.0625-0.1799       
6 06      ALABAMA  2011         0             0         0.0440 01        1      3        6 AL    0.0625-0.1799       ")
quants <- quantile(tmp$PercentFunding, c(0, 0.25, 0.5, 0.75, 1))
quants
#      0%     25%     50%     75%    100% 
# 0.01740 0.04475 0.05640 0.23395 0.67600 
cuts <- cut(tmp$PercentFunding,
            quants, include.lowest = TRUE, dig.lab = 4,
            labels = sprintf("%0.04f-%0.04f", head(quants, n = -1), quants[-1]))
cuts
# [1] 0.0564-0.2339 0.2339-0.6760 0.2339-0.6760 0.0174-0.0447 0.0447-0.0564 0.0174-0.0447
# Levels: 0.0174-0.0447 0.0447-0.0564 0.0564-0.2339 0.2339-0.6760

Granted, this is a factor, but that can easily be converted with as.character if needed.

tmp$PercentFundingBinned <- as.character(cuts)
r2evans
  • 141,215
  • 6
  • 77
  • 149
0

I'd highly recommend you always pay attention to warnings.

You shall not use if when dealing with vectors, because, as displayed in the warning, only the first element will be used:

> if(c(TRUE, FALSE)) 1 else 2
[1] 1
Warning message:
In if (c(TRUE, FALSE)) 1 else 2 :
  the condition has length > 1 and only the first element will be used
> if(c(FALSE, TRUE)) 1 else 2
[1] 2
Warning message:
In if (c(FALSE, TRUE)) 1 else 2 :
  the condition has length > 1 and only the first element will be used

What happens in your case is : the first value is 0.0658, so the if determines it's in the bin 0.0625-0.1799. And because you assign one value to a vector, that value is assigned to each element of the vector.

Instead you can use ifelse:

tmp$PercentFundingBinned <- ifelse (
  tmp$PercentFunding >= quantile(tmp$PercentFunding, 0.75) , 
  paste0(round(quantile(tmp$PercentFunding, 0.75), 4), "-",
         round(max(tmp$PercentFundingBinned), 4)),
  ifelse(tmp$PercentFunding >= median(tmp$PercentFunding),
         paste0(round(median(tmp$PercentFunding),4), "-",
                round(quantile(tmp$PercentFunding, 0.75),4)),
         ifelse(tmp$PercentFunding >= quantile(tmp$PercentFunding, 0.25),
                paste0(round(quantile(tmp$PercentFunding, 0.25),4), "-", 
                       round(median(tmp$PercentFunding),4)), 
                paste0(round(min(tmp$PercentFunding),4), "-", 
                       round(quantile(tmp$PercentFunding, 0.25),4))
         )
    )
)
HubertL
  • 19,246
  • 3
  • 32
  • 51