Split original set to train set and test set

Question

I have a file which is an original set that looks like this

1   1   1   40.57784227583149   27.618035602470936  40.576842275831495  27.617035602470935
1   3   5   40.57784227583149   27.618035602470936  40.576842275831495  27.617035602470935
1   2   4   40.57784227583149   27.618035602470936  40.576842275831495  27.617035602470935
1   10  3   40.57784227583149   27.618035602470936  40.576842275831495  27.617035602470935
1   5   5   40.57784227583149   27.618035602470936  40.576842275831495  27.617035602470935
1   7   4   40.57784227583149   27.618035602470936  40.576842275831495  27.617035602470935
2   7   1   40.576842275831495  27.617035602470935  40.576842275831495  27.617035602470935
2   8   5   40.576842275831495  27.617035602470935  40.5758422758315    27.616035602470934
2   1   5   40.576842275831495  27.617035602470935  40.576842275831495  27.617035602470935
2   5   1   40.576842275831495  27.617035602470935  40.576842275831495  27.617035602470935
2   4   4   40.576842275831495  27.617035602470935  40.5758422758315    27.616035602470934
2   3   2   40.576842275831495  27.617035602470935  40.576842275831495  27.617035602470935
3   5   4   40.576842275831495  27.617035602470935  40.576842275831495  27.617035602470935
3   7   5   40.576842275831495  27.617035602470935  40.576842275831495  27.617035602470935
3   4   1   40.576842275831495  27.617035602470935  40.5758422758315    27.616035602470934
3   8   3   40.576842275831495  27.617035602470935  40.5758422758315    27.616035602470934
3   2   1   40.576842275831495  27.617035602470935  40.576842275831495  27.617035602470935
4   5   4   40.576842275831495  27.617035602470935  40.576842275831495  27.617035602470935
4   9   1   40.576842275831495  27.617035602470935  40.5758422758315    27.616035602470934
4   8   4   40.576842275831495  27.617035602470935  40.5758422758315    27.616035602470934
4   4   4   40.576842275831495  27.617035602470935  40.5758422758315    27.616035602470934
4   10  5   40.576842275831495  27.617035602470935  40.576842275831495  27.617035602470935
4   7   3   40.576842275831495  27.617035602470935  40.576842275831495  27.617035602470935
5   5   1   40.5758422758315    27.616035602470934  40.576842275831495  27.617035602470935
5   2   4   40.5758422758315    27.616035602470934  40.576842275831495  27.617035602470935
5   6   1   40.5758422758315    27.616035602470934  40.5758422758315    27.616035602470934
5   7   3   40.5758422758315    27.616035602470934  40.576842275831495  27.617035602470935
5   10  2   40.5758422758315    27.616035602470934  40.576842275831495  27.617035602470935
5   9   5   40.5758422758315    27.616035602470934  40.5758422758315    27.616035602470934

the first column defines a UserID, the second a StoreID, the third is a Rating, fourth and fifth lng, lat of user current location and the fifth and sixth lng, lat of a store.

Each row defines a user post

I need to split this dataset as follows:

I want to keep the 80% of every user posts in the train set and 20% in the test set.

Searching on google I read about Weka. Some tutorials that I saw they randomly (from what I understand) erase lines but I do not want that. I want what I mentioned above.

So, my question is this:

Is there a tool to do what I need? I am free to other tools except Weka. If Weka can do what I need could somebody provide me some information or a tutorual?

EDIT

To give some more information about what I am trying to do, I am building a recommendation system and to check how accurate it is I need to split the data, calculate the predictions whether a user could like a location that hasn't been and then check my predictions from the recommendation algorithm with these of the test set to calculate the precision/recall, F measure etc..

What I have done so far, is to randomly erase the 20% of each user posts but I think that there are tools that can do this in a better way than mine (obviously).

Thanks in advance!

Let me see if i got this right, you need a tool to read file line by line but would like to program something related with machine learning in java ? — John, May 25 '16 at 19:14
how do you know which should go in train and which go in test? — Nicolas Filotto, May 25 '16 at 19:20
@John I eddited my question fo further info about what I am trying to do — Iraklis, May 25 '16 at 19:21
http://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio — John, May 25 '16 at 19:22
@NicolasFilotto I am not sure what you mean. I want the 20% of each user posts to go to test set. Areyou asking something else? I do not want to erase any user completely or erase many posts from a user — Iraklis, May 25 '16 at 19:22
@John Thanks John I read some very usefull information on the post you suggested. But it is not answering my question of How to spli the data, with what technique I mean. Erasing random lines from each user is good after all? Thank you for your interest! — Iraklis, May 25 '16 at 19:45

Split original set to train set and test set

0 Answers0