Combining data in R

Question

I am analyzing the results of an experiment on a CSV file with variables as columns and participants as rows. Before all my data is collected, I would like to conduct preliminary analyses on the data I already have. However, I need to exclude some of my participants from the analyses. The best way I have come up with to do this without deleting their data (which could cause problems for me later) is to create a new column, call it "exclude," and enter in either a 1 or 0 for each participant to define who is to be excluded. Then when I run the the stats, I just do it on a subset of my data (where exclude == 0, for example).

The problem comes in when I download the complete dataset - how do I get data from my "exclude" column of the preliminary dataset onto the complete dataset, making sure that all the 0s and 1s are attached to the correct participants? I can see how I could just copy and paste if the rows of the preliminary and complete datasets are in the exact same order, but this seems prone to error, and in order to create the exclude column it's a lot easier to sort by different columns. I've tried rbind and merge but they do not work as far as I can tell.

Here is an example of what I'm trying to do:

prelim <- data.frame(
participant = c(1,2,3),
exclude = c(0,1,0)
)

full = data.frame(
participant = c(1,2,3,4,5),
exclude = c(NA,NA,NA,NA,NA)
)

ideal = data.frame(
participant = c(1,2,3,4,5),
exclude = c(0,1,0,NA,NA)
)

I'm guessing (in the absence of an example) that the problems you are having stem from using `attach`. If you stop using `attach` and instead use `with`, `subset` and `[` you will relieve yourself of the enormous confusion cause by the peculiar possibilities created by `attach`. — IRTFM, Jun 06 '13 at 22:33
Heed @DWin's advice, but without a reproducible example (http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) it's impossible to be more helpful — alexwhan, Jun 06 '13 at 22:57
Thanks for the responses. I included a simplified example above, where NAs are blank. Dwin, I haven't been using attach. Intra, when I try the merge function it makes 2 copies of all columns, when I want all the columns to stay the same. I can't figure out your code with %in% - I've never used that operator. Perhaps complicating the matter is that I am creating the exclude column in excel by manually inputting data rather creating the column in R by using if or ifelse. That might be a way around the issue? — user2461563, Jun 07 '13 at 19:05
Nevertheless, it would be best to figure out a way without writing the "exclude" criteria in R, because sometimes I will need to exclude people based on written responses that I have to evaluate individually. — user2461563, Jun 07 '13 at 19:24

score 0 · Answer 1 · answered Jun 07 '13 at 14:37

There are several approaches I'd look at given that we can't see your data.

You could:

Import both datasets and generate your exclude variable based on your conditions and merge with your complete data set based on the participant identifier. Such as:

merge(preliminarydata, completedata, by.x='participantid', by.y='participantid')

Or complete your exclude variable without any merging using the %in% operator.

peopletoexclude <- which(participantsinfulldata %in% participantsinpreliminarydata)
myfulldataset$exclude <- 0
myfulldataset$exclude[peopletoexclude] <- 1

Or probably a zillion other things people can think of.

Or just drop the people you want to exclude and perform your preliminary analysis. Its worth pointing out that unless you explicitly write.csv over your old CSV file, any data manipulation you do in R is not affecting your original CSV in any way. It loads in the data and then forgets about your CSV. If you need to save your analysis you can save(myDatainR, file="myDatainR.Rdata") and come back to it anytime.

score 0 · Answer 2 · answered Jun 11 '13 at 21:46

try ?merge

d <- merge(prelim, full, all = T)
d[!duplicated(d[,1]), ]

or you may be interested in data.table:

library(data.table)
DF1<-data.frame(x=1:3,y=4:6,t=10:12)
DF2<-data.frame(x=3:5,y=6:8,s=1:3)
library(data.table)
DF1 <- data.table(DF1, key = c("x", "y"))
DF2 <- data.table(DF2, key = c("x", "y"))
DF2[DF1] # for example
DF1[!DF2] # or maybe you want this?
DF2[!DF1]

Combining data in R

2 Answers2