-1

I'm new to R. I have a large file with multiple columns and I've been asked to split the data into 2 parts. I have R split the data randomly by 70% into a group called nTrain, and 30% into a group called nTest.
I was able to split the data randomly, but I now need to calculate the AVERAGE of a specific column in the 70% random data and do the same for the 30% random data. Can someone please explain how to do so?

Thanks.

If it helps understand my situation, this is what I have so far in R:

length(DataFile)

(nData=nrow(DataFile))

DataFile

set.seed(0)

(trainIdx<- sample(seq(1,nrow(DataFile)), floor(nrow(DataFile)*0.70)))

> (nTrain=length(trainIdx))
[1] 15129

> (nTest=nData-nTrain)
[1] 6484
Roman
  • 4,744
  • 2
  • 16
  • 58
R Newbie
  • 11
  • 1
  • Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. – Ronak Shah Nov 24 '18 at 08:00
  • Thanks for the advice Ronak. I will read the info on how to ask a good question and how to give a reproducible example. – R Newbie Nov 25 '18 at 06:28

1 Answers1

0

Welcome to Stackoverflow!

  1. In R convention you should stick to the <- operator for most types of assigments (you can find more info here and here).
  2. The code/output you posted is incomplete, really (e.g., the output after the first line, length(DataFile), is missing).

Let's go through this step by step.

1. Create mock data

set.seed(1701)
DataFile <- sample(seq(0, 1, 0.01), 10000, replace = TRUE)

2. Create a dataset

# This randomizes the order
DataSet <- sample(DataFile)

3. Split Train and Test

split <- length(DataSet) * 0.7
# You use length() for one-dimensional objects, and
# nrow() for matrices, tables, etc.

DataTrain <- head(DataSet, split)
DataTest <- tail(DataSet, length(DataSet) - split)

# This approach avoids rounding errors when splitting and
# as our dataset is already randomized we can sample linearly.

4. Calculate average

> mean(DataTrain)
[1] 0.5029891
> mean(DataTest)
[1] 0.496056
Roman
  • 4,744
  • 2
  • 16
  • 58
  • Thanks Roman for the step-by-step instructions. I will try your process and let you know if I can get it to work. – R Newbie Nov 25 '18 at 01:09
  • the file that I'm pulling the data from has 20 columns with headers. I need to pull the average of only 1 of the columns, for only 70% of the data. Can you explain how I can do this? I appreciate your help! – R Newbie Nov 25 '18 at 01:38
  • Can you post a `dput(head(data))` into your original post and specify the column? As a general approach, @AdamB showed the right method if you work with table-shaped data. – Roman Nov 25 '18 at 01:42