How to select x (=1000) number of rows from a excel data set having million rows using R without replacement and make this data as a new data set?
-
1To make it clear - you want to load a sample of size 1000 from excel sheet without loading the full file? – storaged Dec 10 '17 at 11:08
-
1If @storaged has hit on what you want, look into the `openxlsx` package which provides the `read.xlsx()` function and read that function's help file, specifically regarding the `rows` option. As storaged's comment indicates, though, your question is ambiguous, so you may want to consider revising it to make it more clear what you want. – duckmayr Dec 10 '17 at 11:12
-
Convert to `csv` and follow [this link](https://stackoverflow.com/questions/25932628/how-to-read-a-subset-of-large-dataset-in-r) – Sotos Dec 10 '17 at 11:14
-
@storaged - Yes, need to select a random sample of size 1000 without replacement so that the same sample is selected when the program is executed again. – Andy Dec 10 '17 at 11:16
2 Answers
There is a number of libraries for reading XLSX files. I'll use readxl
here but I guess it's just as good as any other for this specific case. Also you need to use set.seed
to make sure random sample is same at every execution.
# Read XLSX file
require(readxl)
df <- read_excel("~/Desktop/yourfile.xlsx")
set.seed(123);
df[sample(1:nrow(df), 1000, replace = F),]

- 61
- 4
Here is an approach using openxlsx::read.xlsx()
that allows one to use the sample IDs to specify the rows read from the incoming spreadsheet. This way, you only have to read 1,000 rows instead of reading 1,000,000 only to subset out all but 1,000 rows.
readxl::read_excel()
does not currently have this feature.
set.seed(90514327)
# create a sample of 1,000 items from 1,000,000
theRows <- sample(1:1000000,1000)
# read excel file using theRows to select the rows being read
# assume first row is header labels, therefore, add 1 to theRows
theRows <- theRows + 1
library(openxlsx)
theData <- read.xlsx("fileName.xlsx",rows=theRows,header=TRUE)
UPDATE 17 Dec 2017: Based on the comments, the OP needs to read a CSV file, not an Excel file. Therefore a different technique is required. read.csv()
does not have a feature similar to openxlsx::read.xlsx()
that allows one to specify a vector of rows to be read from the file. Therefore, one must read the entire file and subset it.
For reproducibility I'll generate a million rows and 10 columns of data, write them to disk with write.csv()
and use readr::read_csv()
with the extract operator [
. readr::read_csv()
runs much faster than base::read.csv()
.
system.time(x <- matrix(runif(10000000),nrow=1000000,ncol=10))
x <- as.data.frame(x)
system.time(write.csv(x,"./data/random.csv",row.names=FALSE))
# create a sample of 1,000 items from 1,000,000
theRows <- sample(1:1000000,1000)
# now use readr::read_csv
library(readr)
system.time(x <- read_csv("./data/random.csv")[theRows,])
nrow(x)
...and the output:
> system.time(x <- matrix(runif(10000000),nrow=1000000,ncol=10))
user system elapsed
0.366 0.060 0.427
> x <- as.data.frame(x)
> system.time(write.csv(x,"./data/random.csv",row.names=FALSE))
user system elapsed
12.444 0.171 12.745
> # create a sample of 1,000 items from 1,000,000
> theRows <- sample(1:1000000,1000)
> # now use readr::read_csv
> library(readr)
> system.time(x <- read_csv("./data/random.csv")[theRows,])
Parsed with column specification:
cols(
V1 = col_double(),
V2 = col_double(),
V3 = col_double(),
V4 = col_double(),
V5 = col_double(),
V6 = col_double(),
V7 = col_double(),
V8 = col_double(),
V9 = col_double(),
V10 = col_double()
)
|==================================================================| 100% 171 MB
user system elapsed
3.289 0.632 4.750
> nrow(x)
[1] 1000
>
Using read.csv() Instead
Here is the performance timing for the same operation with read.csv()
.
> # for comparison, read.csv timing
> system.time(x <- read.csv("./data/random.csv")[theRows,])
user system elapsed
51.921 0.818 53.231
> nrow(x)
[1] 1000
>
Yes, it takes 10X more time to read the file with read.csv()
than readr::read_csv()
.
Hardware Specification
The performance timings were run on a MacBook Pro with the following configuration.

- 10,505
- 2
- 22
- 33
-
used following code but it's not returning 1000 rows but the entire data set. Not sure what I missed here
set.seed(90514327) theRows <- sample(1:1000000,1000) df <- read.csv(file="[csv file path]",header=TRUE,sep=",",nrows=theRows) – Andy Dec 17 '17 at 08:07 -
Hello Andy. the `nrows=` argument in `read.csv()` is not equivalent to the `rows=` argument in `read.xlsx()`. `nrows=` expects a single number, not a vector of row numbers as I used with `read.xlsx()`. Instead, try `df <- read.csv(file="[csv file path]",header=TRUE,sep=",")[theRows,]`, which will read the file, then sample 1,000 items. The OP asked about reading an Excel file, not a comma separated values file. – Len Greski Dec 17 '17 at 10:12
-
I have updated my answer with code that reads a CSV file, including performance timings that compare `readr::read_csv()` with `base::read.csv()`. – Len Greski Dec 17 '17 at 10:47