1

I have uneven lengths in my huge data set. I.e., 700 observations for 2016, 400 observations from 2017. I have a lot of years of data, so manually clipping the datasets is not feasible.

I want to cut them both into quantiles for observations, but only the first 400 for each group.=

There is a tantalizing "minmax" argument in the Hmisc documentation. Is it possible to use the minmax an argument so Hmisc to only cut quantiles from observations 1-400?

  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick May 22 '20 at 00:38
  • That `minmax` argument won't help you. That's just a safety net for those who misspecify the cuts. – Edward May 22 '20 at 00:52

1 Answers1

0

Using dplyr, you can select the first 400 records for each value of year using group_by and slice. Then create quantiles, either within each year or overall.

set.seed(911) # Simulate some uneven data
df <- data.frame(year=rep(2016:2018, times=c(400,500,600)),
                 val=rnorm(1500,50,5))

library(dplyr); library(tidyr)

This creates quantiles within each year

df %>% group_by(year) %>%
  slice(1:400) %>%
  mutate(q4 = cut(val, 
                  breaks=quantile(val, 
                                  probs = seq(0,1,1/4)), 
                  include=TRUE, labels=FALSE)) %>%
# You can stop here and save the output, here I continue to check the counts
  count(q4) %>%
  pivot_wider(names_from=q4, values_from=n)
# A tibble: 3 x 5
# Groups:   year [3]
#   year   `1`   `2`   `3`   `4`
#  <int> <int> <int> <int> <int>
#1  2016   100   100   100   100
#2  2017   100   100   100   100
#3  2018   100   100   100   100

Or you can ungroup to create overall quantiles (counts will differ per year).

df %>% group_by(year) %>%
  slice(1:400) %>%
  ungroup() %>%
  mutate(q4 = cut(val, 
                  breaks=quantile(val, 
                                  probs = seq(0,1,1/4)), 
                  include=TRUE, labels=FALSE)) %>% 
# Stop here to save, or continue to check the counts
  group_by(year) %>%
  count(q4) %>%
  pivot_wider(names_from=q4, values_from=n)

# A tibble: 3 x 5
# Groups:   year [3]
#   year   `1`   `2`   `3`   `4`
#  <int> <int> <int> <int> <int>
#1  2016   116    88   102    94
#2  2017    86   114    85   115
#3  2018    98    98   113    91
Edward
  • 10,360
  • 2
  • 11
  • 26
  • Thanks, Edward! Normally I do all my data handling in MatLab and use R to run unusual models or create GIS layers. I want to be able to work more successfully in R relating to data, your detailed explanation is very very helpful and helps move me forward. – RinsedAndRepeated May 22 '20 at 14:27
  • Oh - one note for anyone else who reads this. dpylr is often also used with the package tidyr, in this case, Edward's use of "pivot_wider" requires the use of tidyr, which is not mentioned. However, when one installs tidyr, with "install.packages("tidyr")" there is a confusing moment where it asks a yes no question about unpacking and installing. Click "no" to install the tidyr package without compiling it yourself. – RinsedAndRepeated May 22 '20 at 17:37