How to perform subsetting of data frame in r?

Question

I have a date frame x2 with following structure.

'data.frame':   31421 obs. of  7 variables:

$ registered_on            : POSIXct, format: "2007-08-29" "2007-09-13" "2008-02-18" "2007-10-07"..
$ trial_id                 : chr  "1" "2" "3" "6" ...
$ ctri_number              : chr  "CTRI/2007/091/000001 " "CTRI/2007/091/000002 "...
$ recruitment_status_india : chr  " Completed" " Completed" " Completed" " Completed" ...
$ recruitment_status_global: chr  " Not Applicable" " Not Applicable" " Not Applicable" " Not Applicable" ...
$ type_of_trial            : Factor w/ 5 levels " "," BA/BE"," Interventional",..: 3 3 3 3 1 3 1 4 3 3 ... 
$ phase                    : Factor w/ 9 levels " N/A"," Phase 1",..: 6 6 4 6 6 6 5 4 4 7 ...

I want to subset this with following conditions :

registered_on >= "2016-06-01" and type_of_trial =="Interventional"

I tried with given code

int_trials = subset(x2, select = (registered_on >= "2016-06-01") &&
   (type_of_trial == "Interventional"),
   select = c(trial_id, ctri_number, registered_on, type_of_trial))

the above code ain't working. Please someone help me findout where I am going wrong. Other suggestions are also welcomed.

THANKYOU IN ADVANCE

Welcome to stackoverflow. Please `dput()` your data. See here how to make a minimal reproducible example: — TarJae, Apr 29 '21 at 06:07

score 1 · Accepted Answer · answered Apr 29 '21 at 06:12

1

Try this :

int_trials = subset(x2, as.Date(registered_on) >= as.Date("2016-06-01") & 
                        type_of_trial == "Interventional", 
                select = c(trial_id, ctri_number, registered_on, type_of_trial))

Or with dplyr you can do :

library(dplyr)

x2 %>%
  filter(as.Date(registered_on) >= as.Date("2016-06-01") & 
         type_of_trial == "Interventional") %>%
  select(trial_id, ctri_number, registered_on, type_of_trial) -> int_trials

answered Apr 29 '21 at 06:12

Ronak Shah

377,200
20
156
213

none of them are working. It's showing : > int_trials [1] trial_id ctri_number registered_on type_of_trial <0 rows> (or 0-length row.names) – classy_BLINK Apr 29 '21 at 06:25
Are you sure you have dates in the data that occur after 1st June 2016? Is `'Interventional'` spelled correctly in the code as it is in the data? – Ronak Shah Apr 29 '21 at 06:28
Yes, I'm pretty sure. – classy_BLINK Apr 29 '21 at 06:41
2

@jaishreemendiratta Well, I see that `'Interventional'` is actually `' Interventional'` (with a space) in your data. so you need to add a space in the code as well. `type_of_trial == " Interventional"`.Another option is to use `trimws` and remove the leading and trailing whitespace from it. Also it is helpful when you share your data in a reproducible format using `dput` so that we get your actual data and can avoid such errors, read about [how to give a reproducible example](http://stackoverflow.com/questions/5963269). – Ronak Shah Apr 29 '21 at 06:46

Chris Ruehlemann · Answer 2 · 2021-04-29T06:55:54.787

If you want to subset your dataframe on two conditions you can subset it on the rows that meet the conditions in the respective columns; note the use of the comma and the blank space after it - that's for the columns, of which you select none:

int_trials <- x2[x2$registered_on >= "2016-06-01" & x2$type_of_trial == "Interventional",]

EDIT:

Using @Sinh's useful toy data, the code works:

x2 <- data.frame(
  registered_on = as.POSIXct(seq(as.Date('2007-01-01'),
                                 as.Date('2008-01-01'),
                                 by = "day")),
  type_of_trial = sample(c(" ", "BA/BE", "Interventional", "sample 4", "sample 5"),
                         366, replace = TRUE)
)

Subset:

int_trials <- x2[x2$registered_on >= "2007-06-01" & x2$type_of_trial == "Interventional",]

Result:

int_trials
          registered_on  type_of_trial
155 2007-06-04 02:00:00 Interventional
157 2007-06-06 02:00:00 Interventional
161 2007-06-10 02:00:00 Interventional
162 2007-06-11 02:00:00 Interventional
173 2007-06-22 02:00:00 Interventional
175 2007-06-24 02:00:00 Interventional
194 2007-07-13 02:00:00 Interventional
196 2007-07-15 02:00:00 Interventional
199 2007-07-18 02:00:00 Interventional
201 2007-07-20 02:00:00 Interventional
211 2007-07-30 02:00:00 Interventional
212 2007-07-31 02:00:00 Interventional
218 2007-08-06 02:00:00 Interventional
222 2007-08-10 02:00:00 Interventional
224 2007-08-12 02:00:00 Interventional
225 2007-08-13 02:00:00 Interventional
228 2007-08-16 02:00:00 Interventional
235 2007-08-23 02:00:00 Interventional
239 2007-08-27 02:00:00 Interventional
241 2007-08-29 02:00:00 Interventional
250 2007-09-07 02:00:00 Interventional
251 2007-09-08 02:00:00 Interventional
255 2007-09-12 02:00:00 Interventional
259 2007-09-16 02:00:00 Interventional
267 2007-09-24 02:00:00 Interventional
271 2007-09-28 02:00:00 Interventional
272 2007-09-29 02:00:00 Interventional
273 2007-09-30 02:00:00 Interventional
274 2007-10-01 02:00:00 Interventional
276 2007-10-03 02:00:00 Interventional
278 2007-10-05 02:00:00 Interventional
280 2007-10-07 02:00:00 Interventional
288 2007-10-15 02:00:00 Interventional
295 2007-10-22 02:00:00 Interventional
305 2007-11-01 01:00:00 Interventional
321 2007-11-17 01:00:00 Interventional
322 2007-11-18 01:00:00 Interventional
325 2007-11-21 01:00:00 Interventional
327 2007-11-23 01:00:00 Interventional
332 2007-11-28 01:00:00 Interventional
333 2007-11-29 01:00:00 Interventional
337 2007-12-03 01:00:00 Interventional
338 2007-12-04 01:00:00 Interventional
346 2007-12-12 01:00:00 Interventional
353 2007-12-19 01:00:00 Interventional
357 2007-12-23 01:00:00 Interventional
359 2007-12-25 01:00:00 Interventional

score 0 · Answer 3 · answered Apr 29 '21 at 06:46

It hard to do it without a dput of your data to ensure that we are working on the same data.

Here is an example using subset with a sample data similar to your str(x2)

x2 <- data.frame(
  registered_on = as.POSIXct(seq(as.Date('2007-01-01'),
    as.Date('2008-01-01'),
    by = "day")),
  type_of_trial = sample(c(" ", "BA/BE", "Interventional", "sample 4", "sample 5"),
    366, replace = TRUE)
)

head(
  # subset function take two variables
  # 1st is the data.frame
  # 2nd is a vector of TRUE/FALSE which indicate which one to included
  #     In this case we have two conditions combine by `&` operator
  subset(x2,
    x2$registered_on >= "2007-06-01" & x2$type_of_trial == "Interventional"),
  20)
#>           registered_on  type_of_trial
#> 154 2007-06-03 07:00:00 Interventional
#> 155 2007-06-04 07:00:00 Interventional
#> 159 2007-06-08 07:00:00 Interventional
#> 160 2007-06-09 07:00:00 Interventional
#> 163 2007-06-12 07:00:00 Interventional
#> 167 2007-06-16 07:00:00 Interventional
#> 169 2007-06-18 07:00:00 Interventional
#> 170 2007-06-19 07:00:00 Interventional
#> 172 2007-06-21 07:00:00 Interventional
#> 180 2007-06-29 07:00:00 Interventional
#> 185 2007-07-04 07:00:00 Interventional
#> 190 2007-07-09 07:00:00 Interventional
#> 197 2007-07-16 07:00:00 Interventional
#> 207 2007-07-26 07:00:00 Interventional
#> 211 2007-07-30 07:00:00 Interventional
#> 218 2007-08-06 07:00:00 Interventional
#> 219 2007-08-07 07:00:00 Interventional
#> 222 2007-08-10 07:00:00 Interventional
#> 224 2007-08-12 07:00:00 Interventional
#> 226 2007-08-14 07:00:00 Interventional

^{Created on 2021-04-29 by the reprex package (v2.0.0)}

score 0 · Answer 4 · answered Apr 29 '21 at 07:44

For the seek of completeness, I put a data.table solution with the example data of @Sinh

x2 <- data.frame(
  registered_on = as.POSIXct(seq(as.Date('2007-01-01'),
                                 as.Date('2008-01-01'),
                                 by = "day")),
  type_of_trial = sample(c(" ", "BA/BE", "Interventional", "sample 4", "sample 5"),
                         366, replace = TRUE)
)

library(data.table)
x <- as.data.table(x2)
xint_trials <- x[registered_on >= "2007-06-01" & type_of_trial == "Interventional"]

and just for fun a microbenchmark of the solutions proposed

library(microbenchmark)
microbenchmark(
  base =x2[x2$registered_on >= "2007-06-01" & x2$type_of_trial == "Interventional",],
  subset=subset(x2,x2$registered_on >= "2007-06-01" & x2$type_of_trial == "Interventional"),
  data.table=x[registered_on >= "2007-06-01" & type_of_trial == "Interventional"],
  dplyr=x2 %>%filter(as.Date(registered_on) >= as.Date("2007-06-01") & type_of_trial == "Interventional"),
  times =1000
)

Unit: microseconds
       expr   min     lq     mean median     uq     max neval cld
       base 278.7 293.85 314.9078  300.3 315.05  8823.2  1000 a  
     subset 312.8 325.10 338.8885  332.2 346.70   835.5  1000 a  
 data.table 380.8 393.60 409.1379  402.8 413.10   838.7  1000  b 
      dplyr 567.3 583.90 668.4504  592.1 606.15 29033.3  1000   c

How to perform subsetting of data frame in r?

4 Answers4