Apply conditional selection to sequence of columns R

Question

I use data from the NHANES periodontal dataset (https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/OHXPER_F.htm) and after cleaning it to only keep the "pc" variables, I have a df=setPD 168 columns that include 6 measurements (pcd, pcm, pcs, pcp, pcl, pca) around 28 teeth numbered from #02 to #31

#names(setPD)
  [1] "ohx02pcd" "ohx02pcm" "ohx02pcs" "ohx02pcp" "ohx02pcl" "ohx02pca" "ohx03pcd" "ohx03pcm" "ohx03pcs" "ohx03pcp" "ohx03pcl" "ohx03pca"
 [13] "ohx04pcd" "ohx04pcm" "ohx04pcs" "ohx04pcp" "ohx04pcl" "ohx04pca" "ohx05pcd" "ohx05pcm" "ohx05pcs" "ohx05pcp" "ohx05pcl" "ohx05pca"
 [25] "ohx06pcd" "ohx06pcm" "ohx06pcs" "ohx06pcp" "ohx06pcl" "ohx06pca" "ohx07pcd" "ohx07pcm" "ohx07pcs" "ohx07pcp" "ohx07pcl" "ohx07pca"
 [37] "ohx08pcd" "ohx08pcm" "ohx08pcs" "ohx08pcp" "ohx08pcl" "ohx08pca" "ohx09pcd" "ohx09pcm" "ohx09pcs" "ohx09pcp" "ohx09pcl" "ohx09pca"
 [49] "ohx10pcd" "ohx10pcm" "ohx10pcs" "ohx10pcp" "ohx10pcl" "ohx10pca" "ohx11pcd" "ohx11pcm" "ohx11pcs" "ohx11pcp" "ohx11pcl" "ohx11pca"
 [61] "ohx12pcd" "ohx12pcm" "ohx12pcs" "ohx12pcp" "ohx12pcl" "ohx12pca" "ohx13pcd" "ohx13pcm" "ohx13pcs" "ohx13pcp" "ohx13pcl" "ohx13pca"
 [73] "ohx14pcd" "ohx14pcm" "ohx14pcs" "ohx14pcp" "ohx14pcl" "ohx14pca" "ohx15pcd" "ohx15pcm" "ohx15pcs" "ohx15pcp" "ohx15pcl" "ohx15pca"
 [85] "ohx18pcd" "ohx18pcm" "ohx18pcs" "ohx18pcp" "ohx18pcl" "ohx18pca" "ohx19pcd" "ohx19pcm" "ohx19pcs" "ohx19pcp" "ohx19pcl" "ohx19pca"
 [97] "ohx20pcd" "ohx20pcm" "ohx20pcs" "ohx20pcp" "ohx20pcl" "ohx20pca" "ohx21pcd" "ohx21pcm" "ohx21pcs" "ohx21pcp" "ohx21pcl" "ohx21pca"
[109] "ohx22pcd" "ohx22pcm" "ohx22pcs" "ohx22pcp" "ohx22pcl" "ohx22pca" "ohx23pcd" "ohx23pcm" "ohx23pcs" "ohx23pcp" "ohx23pcl" "ohx23pca"
[121] "ohx24pcd" "ohx24pcm" "ohx24pcs" "ohx24pcp" "ohx24pcl" "ohx24pca" "ohx25pcd" "ohx25pcm" "ohx25pcs" "ohx25pcp" "ohx25pcl" "ohx25pca"
[133] "ohx26pcd" "ohx26pcm" "ohx26pcs" "ohx26pcp" "ohx26pcl" "ohx26pca" "ohx27pcd" "ohx27pcm" "ohx27pcs" "ohx27pcp" "ohx27pcl" "ohx27pca"
[145] "ohx28pcd" "ohx28pcm" "ohx28pcs" "ohx28pcp" "ohx28pcl" "ohx28pca" "ohx29pcd" "ohx29pcm" "ohx29pcs" "ohx29pcp" "ohx29pcl" "ohx29pca"
[157] "ohx30pcd" "ohx30pcm" "ohx30pcs" "ohx30pcp" "ohx30pcl" "ohx30pca" "ohx31pcd" "ohx31pcm" "ohx31pcs" "ohx31pcp" "ohx31pcl" "ohx31pca"

I am trying to apply a conditional selection in each group of six columns. This is:

transmute(setPD,PD02 = ifelse(setPD$ohx02pcd >5 | 
    setPD$ohx02pcm>5 |setPD$ohx02pcs >5| 
    setPD$ohx02pcp >5 | setPD$ohx02pcl >5 | 
    setPD$ohx02pca >5, 1, 0))

Then for the next tooth (03) I have to write again:

transmute(setPD,PD03 = ifelse(setPD$ohx03pcd >5 |
   setPD$ohx03pcm>5|setPD$ohx03pcs >5|
   setPD$ohx03pcp >5|setPD$ohx03pcl >5|
   setPD$ohx03pca >5, 1, 0))

I tried to firstly do that conditional selection in a more efficient way, something like:

transmute(setPD,PD02 = ifelse(list(setPD$ohx02pcd:setPD$ohx02pcp) >5, 1, 0))

but it does not work.

Then I am looking for a way to write a loop that does that over each tooth without needing to write this 28 times!! I thought of applying the select function of dplyr in a for loop but I don't know how to do that. At the end I want to get all the new columns I made with transmute and say that if at least 2 of the 28 columns are 1, then I have disease, if <2 are 1 then I have health. ANy help would be appreciated.

**Note: If you want to get the dataset, it is open access from CDC.org: https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/OHXPER_F.htm **

@user3495945 - Just a comment - I almost never download files from dropbox or file-sharing sites without knowing personally who put them there. I suspect many here would feel the same. The onus is on you to provide a small example of reproducible code that can usually be copied and pasted straight from the question. — thelatemail, Mar 06 '16 at 22:56
@thelatemail I am novice in R. How do you recommend doing that? As I said in the question the dataset is publicly available from the CDC.org website which should be a very safe source https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/OHXPER_F.htm — user3495945, Mar 06 '16 at 23:31
@user3495945 - see the FAQ here - http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — thelatemail, Mar 06 '16 at 23:39
I tried to paste it but even 4 rows are unfortunately 20,000 characters longer than allowed — user3495945, Mar 06 '16 at 23:53
How about something like `dput(droplevels(your_data[1:6, 1:10]))`. Should be reproducible and pretty short. — Gregor Thomas, Mar 07 '16 at 00:04
I'm a little confused. Your text talks about "selection", which I think of as filtering - getting rid of rows and columns. But all your code examples use `transmute` which *adds* columns. Do you actually want to filter your data down or do you want to add new "flag" columns? — Gregor Thomas, Mar 07 '16 at 00:07
Hi Gregor, you are correct. I want to add a column with the conditional statement from the ifelse after every 6 columns and get rid of the 6 columns; thus use of transmute. I will use these "new" columns to do the second step of ifelse — user3495945, Mar 07 '16 at 00:18
dput(droplevels(set09ex[1:6,1:9])) structure(list(SEQN = c(51624L, 51630L, 51643L, 51645L, 51647L, 51656L), OHX02PCD = c(2L, 2L, 2L, 2L, 1L, 1L), OHX02PCM = c(1L, 1L, 2L, 2L, 1L, 1L), OHX02PCS = c(1L, 4L, 2L, 4L, 2L, 1L), OHX02PCP = c(2L, 1L, 2L, 3L, 1L, 1L), OHX02PCL = c(1L, 1L, 2L, 4L, 1L, 2L), OHX02PCA = c(2L, 3L, 2L, 7L, 2L, 3L), OHX02LAD = c(2L, 1L, 1L, 1L, 2L, 1L), OHX02LAM = c(1L, 1L, 1L, 2L, 2L, 1L)), .Names = c("SEQN", "OHX02PCD", "OHX02PCM", "OHX02PCS", "OHX02PCP", "OHX02PCL", "OHX02PCA", "OHX02LAD", "OHX02LAM" ), row.names = c(1L, 3L, 6L, 7L, 8L, 12L), class = "data.frame") — user3495945, Mar 07 '16 at 00:20
I was thinking for something like for (i in 1:9){ transmute(setPD,i = ifelse(setPD[,paste0("ohx0",i,"pcd")]>5 | setPD[,paste0("ohx0",i,"pcm")]>5 |setPD[,paste0("ohx0",i,"pcs")]>5| setPD[,paste0("ohx0",i,"pcp")]>5 | setPD[,paste0("ohx0",i,"pcl")]>5 | setPD[,paste0("ohx0",i,"pca")]>5, 1, 0)) but it doesnt seem to work } — user3495945, Mar 07 '16 at 00:23

coffeinjunky · Answer 1 · 2016-03-09T11:25:20.873

First, it is useful to point out that the logical statements of the form is A true OR is B true OR is C true are equivalent to asking is ANY of A,B,C true? We can use this to simplify the statements setPD$ohx02pcd >5 | setPD$ohx02pcm>5 |setPD$ohx02pcs >5| ... to ask if for any of these columns it is true that their value is larger than 5.

For example, let us focus on tooth number 02 first. To get all columns that concern this tooth, we can use grep to get a vector of column names. This can be achieved with

current_tooth <- grep("02", names(setPD), value = T)

Note that if there are any other columns in the data that contain the string 02, these columns will also show up. This does not appear to be the case in your data, but it is worthwhile pointing out here in case someone else uses it and this applies in other datasets.

Now, we can use these names to subset the dataframe. For instance,

 setPD[,current_tooth]

will give you the corresponding columns. In each row, we want to check if any of the above mentioned conditions are true. Given a vector of logical statements, we can check if any of them is true with the function any. To go through a dataframe by row and apply a function, we can use apply, such as in

setPD$PD02 <- 
  apply(setPD[,grep("02", names(setPD), value = T)], 1, function(x) any(x>5))

Now, the above applies to one tooth only, namely 02. One way of doing it for all teeth is to create a vector with all tooth indicators and use this to loop over the above lines, replacing the "02" in the above grep call in each iteration and using assign or something similar to get the variable name right. A more elegant and more efficient way is to use the same principle on long data. Consider the following:

library(reshape2)
library(dplyr)

m <- melt(setPD, id.vars="SEQN")
m$num <- substr(m$variable, 4,5) # be careful here and check output!
m <- m %>% group_by(num) %>% mutate(PS = any(value>5))
m$num <- paste0("PS", m$num)
md <- dcast(m, SEQN ~ num, value.var = "PS")
setPD <- merge(setPD, md, by="SEQN")

This melts your data first and creates a variable num that indicates your tooth. Again, make sure that this works. I have used the fact that in your data, the tooth number all appear in the 4th and 5th place in the character string. Make sure this is true, and adjust the code otherwise. Then I create a variable PS which indicates whether any of the columns that contain the tooth identifer has a value larger than 5. Last but not least I recast the data so that you have the values of PD02, PD03, etc in columns again, before I merge this to the old dataset. The line with paste0 merely creates the variable names that you want to have.

that's awesome!!!! How do you think can one apply your grep approach which is pretty cool to all columns instead of manually doing that for "02" and then for "03" and so on? — user3495945, Mar 08 '16 at 17:17
Well, if you still want to evaluate the logical statements for each tooth individually, then you should use my second approach, or wrap a loop around the statements. If you want to evaluate all teeth at once, you could use `grep("0[1-9]", names(setPD), value = T)`. But again, please check that this retrieves no unwanted columns. — coffeinjunky, Mar 08 '16 at 17:22
@coffeeinjunky I tried this and it returned a dataframe with 28 PS values being "6" in every row. something does not work. — user3495945, Mar 20 '16 at 03:53
The number of teeth with the condition (value>5) seem to be in a column titled "m$value" but don't get recasted — user3495945, Mar 20 '16 at 04:02
Could you post a more extensive sample of your data that reproduces this behavior? In your current sample, there is only the 02 tooth included. — coffeinjunky, Mar 20 '16 at 12:40
Also, did you check if the line in which `num` gets defined works? That is, `substr` etc. — coffeinjunky, Mar 20 '16 at 12:42
@coffeejunky num seems to work. I tried posting more of the dataset but it is not possible due to length. You could duplicate the dataset and change to 03 if this helps. — user3495945, Mar 29 '16 at 17:38

Apply conditional selection to sequence of columns R

1 Answers1