0

I am working with the following dataset

library(data.table)
dat <- fread("https://www.dropbox.com/s/kj66h9shv6zge91/mydat.csv?dl=1")

which looks like this:

            source_id experiment_id variable_id
   1: CESM2-WACCM-FV2    historical          pr
   2: CESM2-WACCM-FV2    historical          pr
   3: CESM2-WACCM-FV2    historical         tas
   4: CESM2-WACCM-FV2    historical         tas
   5:     FGOALS-f3-L    historical          pr
  ---                                          
5657:      MRI-ESM2-0        ssp585          pr
5658:     CESM2-WACCM        ssp585          pr
5659:     CESM2-WACCM        ssp585         tas
5660:     CESM2-WACCM        ssp585         tas
5661:     CESM2-WACCM        ssp585      tasmax

For each variable_id, I am trying to find a list of the elements in source_id that are simultaneously present in all elements of experiment_id (e.g. "historical", "ssp126", "ssp245", "ssp370", "ssp585").

Any ideas on how to get there? Looks like a simple question, but I could not find an adequate answer on SO that works with characters rather than numeric values.

thiagoveloso
  • 2,537
  • 3
  • 28
  • 57

1 Answers1

1

Maybe this will help :

by(dat, dat$variable_id, function(x) 
        Reduce(intersect, split(x$source_id, x$experiment_id)))

#dat$variable_id: pr
# [1] "BCC-CSM2-MR" "MRI-ESM2-0"  "CESM2-WACCM" "INM-CM5-0"  "INM-CM4-8"    
# [6] "MPI-ESM1-2-HR" "CMCC-CM2-SR5"  "NorESM2-MM"  "EC-Earth3"  "EC-Earth3-Veg"
#[11] "GFDL-ESM4"    
#-------------------------------------------------------------------------- 
#dat$variable_id: tas
# [1] "BCC-CSM2-MR"   "MRI-ESM2-0"    "CESM2-WACCM"   "AWI-CM-1-1-MR" "INM-CM4-8"
# [6] "INM-CM5-0"     "MPI-ESM1-2-HR" "CMCC-CM2-SR5"  "NorESM2-MM"    "EC-Earth3"
#[11] "EC-Earth3-Veg" "GFDL-ESM4"    
#-------------------------------------------------------------------------- 
#dat$variable_id: tasmax
# [1] "BCC-CSM2-MR"   "MRI-ESM2-0"    "AWI-CM-1-1-MR" "INM-CM4-8"     "INM-CM5-0"
# [6] "MPI-ESM1-2-HR" "NorESM2-MM"    "EC-Earth3"     "EC-Earth3-Veg" "GFDL-ESM4"
#-------------------------------------------------------------------------- 
#dat$variable_id: tasmin
# [1] "BCC-CSM2-MR"   "MRI-ESM2-0"    "AWI-CM-1-1-MR" "INM-CM4-8"     "INM-CM5-0"
# [6] "MPI-ESM1-2-HR" "NorESM2-MM"    "EC-Earth3"     "EC-Earth3-Veg" "GFDL-ESM4"

For each variable_id this returns common source_id present in all experiment_id.


If you want to find out common source_id for each variable_id and each experiment_id

Reduce(intersect, split(dat$source_id, list(dat$variable_id, dat$experiment_id)))

#[1] "BCC-CSM2-MR"   "MRI-ESM2-0"    "INM-CM5-0"     "INM-CM4-8"    
#[5] "MPI-ESM1-2-HR" "NorESM2-MM"    "EC-Earth3"     "EC-Earth3-Veg"
#[9] "GFDL-ESM4"    
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213