How to create a test set from multiple datasets and avoid deleting variables in the process?

Question

I have three datasets that I want to join in order to create a test set for being used in a supervised machine learning algorithm. The problem is that although they have some variables in common, they generally differ in number of rows and elements. I have tried to use merge() function, but however, the more I use it, the lesser number of rows I get. And at the end, I get a small dataset with a ridiculous number of rows.

I have these three datasets:

test_review   nºrows 22956
test_business nrows  1205
test_user     nrows  5105

I want to keep the original number of reviews from test_review dataset (22956) for the ultimate test_set. The idea is that the business or user that has no coincidence at the time using merge() with the review_set,it appears as Na value in the corresponding new column as a result of merging both datasets. Is there any way to make possible this?

The way to ensure that `merge` doesn't remove any rows of data is to include the argument `all=TRUE`. — eipi10, Sep 12 '17 at 18:22
Post `head(test_review); head(test_business); head(test_user)` — pogibas, Sep 12 '17 at 18:22

score 0 · Answer 1 · answered Sep 12 '17 at 19:26

0

you can try

library(plyr)
rbind.fill(test_review,test_business,test_user)

answered Sep 12 '17 at 19:26

moodymudskipper

46,417
11
121
167

How to create a test set from multiple datasets and avoid deleting variables in the process?

1 Answers1