0

background: - dataframe with 60.000 lines - 5 columns: pt/bi/sx/ex/re - pt = subject; bi = birth; sx = sex; ex = exam (14 types); re = result of exam

> head(fim)
   pct  nasc        sex     exam    res
1  ACF  11/09/1951  F       ldl     81
2  ACF  11/09/1951  F       colt    172
3  ACF  11/09/1951  F       tg      152
4  ACF  11/09/1951  F       ferr    28,1
5  ACF  11/09/1951  F       fe      41
6  ACF  11/09/1951  F       plq     256000
...

So.. as you can see, each subject has at least 14 rows corresponding to 14 exams with their results.

My problem is that I want to subset all patients and their set of exams based on a exam result. An example: I would like to have all subjects and their set of exams that has the exam1 == 15 or "positive".

Despite having tried several ways, the only solution I think is possible is through casting to wide format, selecting and reshaping again. BUT when I use the cast function, all values are changed:

library(reshape)
df_wide <- cast(df, pt~ex)

Long to wide works fine, but the original values are lost to new ones. Can anyone help me with that or has another idea on how I can subset it in another way?

> head(dfw)
    pct     hcv     ldl     colt    cr      ferr    fe...
1   AFC     R       73      157     9,56    1687,0  80
2   AAPS    R       78      130     0,91    879,0   104 
3   ASS     R       96      151     0,76    666,2   138
4   ARS     R       67      115     0,73    674,0   133
5   ARDS    R       180     261     0,71    105,0   110
...

Solution:

keep <- dfw[dfw$exam == "hcv" & fim$res == "R", "pct"]
dfw = dfw[!duplicated(dfw), ]
subset_dfw <- filter(dfw, pct %in% keep)
subset_dfw %>% group_by(pct) %>% filter (!duplicated(exam))
Henrique
  • 146
  • 7
  • 1
    Reshaping to wide seems like a drastic manipulation to get a simple subset. First figure out which subjects you want to keep: `subjects_to_keep = df[df$ex = "ex1" & df$re == 15, "pt"]`, then subset based on those subjects `df[df$pt %in% subjects_to_keep, ]`. – Gregor Thomas Jan 10 '16 at 18:41
  • 1
    Your questions will be much more readable and get friendlier attention if you use proper capitalization, use backticks to format inline code, and share copy/pasteable data by either sharing it with `dput()` or sharing some code to simulate an example data set, as recommended at the [reproducible example FAQ](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Showing the desired output for your sample input also makes things very clear so you don't get answers that aren't *quite* right. – Gregor Thomas Jan 10 '16 at 18:56
  • Thanks a lot for your help, it worked! and, Im new in this...so im sorry for that – Henrique Jan 10 '16 at 19:38

1 Answers1

3

You may want to consider dplyr library which allows very good options to manipulate data. For this task, you can try something like this:

library(dplyr)
df <- filter(df, ex == 'ex1' & re == 15)

If you want to do with base package, you can do something like this:

df <- df[df$ex == 'ex1' & df$re == 15, ]

Edit:

If the goal is to keep all rows for a patient as long as any one row has ex1 & 15, you can achieve that as follows:

library(dplyr)
ptToKeep <- filter(df, ex == 'ex1' & re == 15)$pt
df <- filter(df, pt %in% ptToKeep)

Or, with base as shown in the comment above:

ptToKeep <- df[df$ex == 'ex1' & df$re == 15, ]$pt
df <- df[pt %in% ptToKeep, ]
Gopala
  • 10,363
  • 7
  • 45
  • 77