2

I have a data.frame of galaxies and their distances (z):

> head(sdss16, 10)
                 SDSS  RAJ2000   DEJ2000   MJD Class QSO        z    umag    gmag    rmag    imag    zmag    e_umag    e_gmag    e_rmag    e_imag    e_zmag
1  000000.15+353104.2 0.000629 35.517841 58402     0   1 0.845435 18.9640 18.6307 18.4295 18.4118 18.2555 0.0248228 0.0138142 0.0173684 0.0171765 0.0281816
2  000000.33+310325.3 0.001415 31.057048 58073     0   1 2.035491 22.0825 21.7871 21.5621 21.3595 20.9340 0.1381920 0.0461832 0.0504525 0.0603687 0.1857780
3  000000.36+070350.8 0.001535  7.064129 58449     0   1 1.574227 22.5173 22.1028 21.8542 21.6380 21.8888 0.2093710 0.0641275 0.0674263 0.0829677 0.2956540
4  000000.36+274356.2 0.001526 27.732283 57654     0   1 1.770552 22.3475 21.9031 21.7528 21.6635 21.9946 0.1889810 0.0556878 0.0731551 0.0841880 0.3567380
5  000000.45+092308.2 0.001914  9.385637 58450     0   1 2.024146 18.7664 18.6627 18.4998 18.3365 18.1586 0.0261839 0.0309531 0.0179315 0.0260643 0.0214897
6  000000.45+174625.4 0.001898 17.773739 56945     3   1 2.309000 22.4403 21.9089 22.0700 21.9268 21.3725 0.2871240 0.0677072 0.1153900 0.1489100 0.3854550
7  000000.47-002703.9 0.001978 -0.451088 55477     3   1 0.250000 21.6832 21.1946 20.5092 20.1535 19.8793 0.1288200 0.0415909 0.0301123 0.0290315 0.0765198
8  000000.57+055630.8 0.002375  5.941903 57367     0   1 2.102771 22.3606 21.6176 21.3399 21.2840 20.7872 0.3101850 0.0539608 0.0710789 0.1014390 0.2420300
9  000000.62+311944.3 0.002595 31.328982 58073     0   1 1.991313 19.6818 19.4060 19.3189 19.0364 18.8358 0.0299476 0.0160732 0.0150661 0.0247494 0.0376382
10 000000.66+145828.8 0.002756 14.974675 56268     3   1 2.497000 21.9420 21.2236 20.8861 20.7823 20.6592 0.1638730 0.0360871 0.0372218 0.0509094 0.2107500

I want to add a new column which describes the z as 'Low', 'Medium', or 'High' based on which quantile the galaxy is in:

summary(z)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
-0.002643  1.177832  1.692103  1.740606  2.260000  7.023917         4

I can use

lowz <- sdss16 %>% filter(z < quantile(z, 0.25))
midz <- sdss16 %>% filter(z >= quantile(z, 0.25) & z < quantile(z, 0.75))
hiz <-  sdss16 %>% filter(z >= quantile(z, 0.75))

so my question is, how can I add a new column based on the quartiles, as described?

Jim421616
  • 1,434
  • 3
  • 22
  • 47
  • what is `quartile`?? which package is it from?. I only know of `quantile` and `quantile(z, 1)` will give you the maximum value of `z` and we know that apart from the maximum value, every other value is Z will satisfy the condition `z – Onyambu May 17 '21 at 23:10
  • `quartile` was my guess that there might a function by that name, which will do what I want. Apparently, there isn't. – Jim421616 May 17 '21 at 23:11
  • A funtion that will do what exactly? – Onyambu May 17 '21 at 23:11
  • I was thinking, `quartile(z, 1)` would return the first quartile of the `z` column. I could write a function to do this, I suppose, but is there one in the tidyverse? – Jim421616 May 17 '21 at 23:13
  • Do you mean `quantile`? I am still confused. I do not know what `quartile` is BUT I KNOW WHAT `quantile` is – Onyambu May 17 '21 at 23:15
  • If you need the first quantile just do `quantile(z, 0.25)` – Onyambu May 17 '21 at 23:16
  • @Onyambu that does answer my first question, thank you. I've edited my post. – Jim421616 May 17 '21 at 23:33
  • You could use `cut` if that is what you want – Onyambu May 17 '21 at 23:41
  • Possible duplicate/Related https://stackoverflow.com/questions/12979456/categorize-numeric-variable-into-group-bins-breaks – Ronak Shah May 18 '21 at 02:28

2 Answers2

2
library(dplyr)

df %>% 
  mutate(cat = cut(z, 
                   c(-Inf, quantile(z, c(.25, .75)), Inf), 
                   labels = c("Low", "Medium", "High")))

Then if you need to split them into different dataframes you can use split:

split(df, df$cat)
LMc
  • 12,577
  • 3
  • 31
  • 43
  • 2
    `cut(z, quantile(0, .25, .75, 1), c("low", "Medium", "High"), include.lowest = TRUE)` should do the trick. +1 still – Onyambu May 17 '21 at 23:43
1

Perhaps this will work?

library(tidyverse)
sdss16 %>% 
  mutate(z_category = case_when(z < quantile(z, 0.25) ~ "Low",
                                  z >= quantile(z, 0.25) & z <= quantile(z, 0.75) ~ "Medium",
                                  z > quantile(z, 0.75) ~ "High"))
jared_mamrot
  • 22,354
  • 4
  • 21
  • 46