How to input sampled missing data from the full dataset in R

Question

I am using a sampling method to a data about 700000 row and 5 columns for the full dataset. The columns are, x1 ; x2 ; x3 ; x4 ; y.
After I use the method in Matlab for the sampling method, i got around 400000 sampled dataset. But the trouble is, the method did not sample the "y", which means the "y" is in full dataset and separated from x1;x2;x3;x4 . For weeks i tried to figure out the code again and still the "y" is not sampled. Therefore i have to find another way to match the dataset , The sampling method did not randomized the full data set, so it's particularly in order but the sampling method removes a few dataset. This is the screenshot of the data

So the above is the screenshot, we can see that from the "FULL DATASET", the Sampled only takes a few data. The highlighted blue color from "FULL DATASET" is the data taken into "Sampled", meanwhile the Black Text Color in "FULL DATASET" is removed, that's why there are no data from the Black Text Color in Full dataset in "Sampled". From the "Sampled" the Y is missing, i can fill in manually for this but it would take a very long time since the sampled data have around 400000 dataset. So how can i fill in the "Y" in "Sampled" from the "Full Dataset" that have been sampled using R dataframe?

Update

     inputdata <- function(pop,sam) 
  {

  dfpop <- data.frame(pop)
  dfsam <- data.frame(sam)
  ndfpop = nrow(dfpop)
  ndfsam = nrow(dfsam)

  for ( i in 1:ndfsam) {
    if( dfsam[i,1] == dfpop[i,1] && dfsam[i,2] == dfpop[i,2] && dfsam[i,3] == dfpop[i,3] && dfsam[i,4] == dfpop[i,4] ) {
      completesam<- print(dfpop[i,5] == dfsam[i,5])
    }
  }
write.csv(completesam, file = "D://completesampling.csv")
}

Previously i used Excel for this case, but since the work prefer R i used R instead. The function return FALSE all along the row , and i put multiple expression inside the IF

[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE

Do you have any idea which part of the code is missing?

Please see this question: http://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right. Can't vote to close because of the bounty but this definitely should be. Also note that if you have multiple Y values for combinations of X1, X2, X3 and X4 you will not be able to accurately pull Y — Chris, Apr 15 '16 at 19:26
@Chris, although you can flag it for moderator to close: http://meta.stackexchange.com/questions/14591/how-can-we-close-questions-with-bounties — dww, Apr 18 '16 at 19:03
Does the dupe answer your question? If it does, I'll close this as a dupe and refund your bounty. — Robert Harvey, Apr 21 '16 at 22:57
@RobertHarvey I have answered this question below, thank you — user3292755, Apr 22 '16 at 11:32

bVa · Answer 1 · 2016-04-16T09:13:36.883

If you know that every combination is unique (between x1, x2, x3, x4), I recommand using merge

data.frame <- merge(dfsam, dfpop, by = c("x1", "x2", "x3", "x4"), all = F)

This code should add the Y column when there is match between the 4 others columns values.

[EDIT]

I wrote the beginning of your data to see if merge could really work.

dfpop = data.frame(x1 = c(4, 3, 0, 2, 1, 4, 2, 0, 3, 0, 0, 2, 0, 0, 3, 0, 2, 0, 4, 0, 2, 1, 0), 
             x2 = c(7500, 14600,  5700,  7500,  7800,  7500,  7500,  6648,  7886,  6858,
                    8077,  7600,  8077,  8077,  7500,  8077, 14600, 7600,  3500, 14172, 
                    7600,  7500,  7500), 
             x3 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
             x4 = c(1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1),
             y = c(0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1))


dfsam = dfpop[c(5,17,23),c("x1", "x2", "x3", "x4")] # 3 highlighted lines without y column

df = merge(dfsam, dfpop, by = c("x1", "x2", "x3", "x4")) # merging the 2 data.frames

> print(df)
  x1    x2 x3 x4 y
1  0  7500  0  1 1
2  1  7800  0  1 1
3  2 14600  0  1 1

As you can see, the y column is correctly added to the dfsam data.frame. There are no problems with different length of vector.

I guess that you may have an error in the source data.

Thank you, i think 'merge' can be useful for this case, but this happens when i tried the function `Error in merge.data.frame(dfsam, dfpop, by = c("x1", "x2", "x3", "x4"), : negative length vectors are not allowed` Is this happens because i have different length of vector? `dfsam` have 386431 row, while `dfpop` have 709899 row. So the difference is large enough — user3292755, Apr 16 '16 at 01:22

score 2 · Answer 2 · answered Apr 14 '16 at 13:29

When you are using INDEX formula, if you are giving multiple columns as reference array(Column A to D in your case) you need to give Column Number as a third parameter(1 to 4 in your case). So if you use below formula you will not receive any error message.

=INDEX($A$3:$D$500002,MATCH($E$3,E3:$E$500002,0),1)

But the with your MATCH formula you will get row number 3 every single time as return value. So you will get the same value with this combination of formula. Instead of $E$3 you need a value to be compared in the array E3:$E$500002. But also you need unique values in the lookup array to compare as these lookup formulas returns the first value they find in the array. So you will not be able to find the second value if there is a duplicate.

What I suggest(if it is possible), insert a new column as the first column of your dataset and name it as "Record Number" which is 1 to 700K, an incremental number for each row.

Than than on the right hand side you can return the 6th column(Y) value with a lookup formula.

Thank you so much for your response, but at present the work changed into R — user3292755, Apr 15 '16 at 03:52

score -1 · Answer 3 · answered Apr 19 '16 at 19:38

Are you just trying to take a random sample of your dataframe? That's easy to do with something like this:

mtcars[sample(seq_along(mtcars[, 1]), 20, FALSE), ]

What you're doing with this approach is creating a random set of numbers from 1 to the number of rows in your dataset, then taking 20 random numbers from that dataset. You can then pass that back into the subset operation to get just the rows that match those numbers. The "FALSE" indicates that you want the sampling to take place without replacement, which is what you want unless you're doing bootstrapping.

score -1 · Accepted Answer · answered Apr 22 '16 at 11:32

Finally i come up with this and things go fine, thank you so much for all your response

populationHC <- read.csv("E:/populationHC.csv")
sampledHC <- read.csv("E:/sampledHC.csv")

nsampled = dim(sampledHC)[1]
npopulation = dim(populationHC)[1]

for (i in 1:nsampled)
{
  j = 1
  check = 0
  while (check < 4)
  {
    j = j+1
    if (sampledHC[i,1]==populationHC[j,1]){check1=1}else{check1=0}
    if (sampledHC[i,2]==populationHC[j,2]){check2=1}else{check2=0}
    if (sampledHC[i,3]==populationHC[j,3]){check3=1}else{check3=0}
    if (sampledHC[i,4]==populationHC[j,4]){check4=1}else{check4=0}
    check = check1 + check2 + check3 + check4
  }
  sampledHC[i,5]=populationHC[j,5]
  print(paste('Sampled data -',i))
}

write.csv(sampledHC, "E:/HCsampled.csv")

How to input sampled missing data from the full dataset in R

4 Answers4