0

How to select x (=1000) number of rows from a excel data set having million rows using R without replacement and make this data as a new data set?

  • 1
    To make it clear - you want to load a sample of size 1000 from excel sheet without loading the full file? – storaged Dec 10 '17 at 11:08
  • 1
    If @storaged has hit on what you want, look into the `openxlsx` package which provides the `read.xlsx()` function and read that function's help file, specifically regarding the `rows` option. As storaged's comment indicates, though, your question is ambiguous, so you may want to consider revising it to make it more clear what you want. – duckmayr Dec 10 '17 at 11:12
  • Convert to `csv` and follow [this link](https://stackoverflow.com/questions/25932628/how-to-read-a-subset-of-large-dataset-in-r) – Sotos Dec 10 '17 at 11:14
  • @storaged - Yes, need to select a random sample of size 1000 without replacement so that the same sample is selected when the program is executed again. – Andy Dec 10 '17 at 11:16

2 Answers2

0

There is a number of libraries for reading XLSX files. I'll use readxl here but I guess it's just as good as any other for this specific case. Also you need to use set.seed to make sure random sample is same at every execution.

# Read XLSX file
require(readxl)
df <- read_excel("~/Desktop/yourfile.xlsx")

set.seed(123);
df[sample(1:nrow(df), 1000, replace = F),]
baci
  • 61
  • 4
0

Here is an approach using openxlsx::read.xlsx() that allows one to use the sample IDs to specify the rows read from the incoming spreadsheet. This way, you only have to read 1,000 rows instead of reading 1,000,000 only to subset out all but 1,000 rows.

readxl::read_excel() does not currently have this feature.

set.seed(90514327)
# create a sample of 1,000 items from 1,000,000
theRows <- sample(1:1000000,1000)

# read excel file using theRows to select the rows being read
# assume first row is header labels, therefore, add 1 to theRows
theRows <- theRows + 1
library(openxlsx)
theData <- read.xlsx("fileName.xlsx",rows=theRows,header=TRUE)

UPDATE 17 Dec 2017: Based on the comments, the OP needs to read a CSV file, not an Excel file. Therefore a different technique is required. read.csv() does not have a feature similar to openxlsx::read.xlsx() that allows one to specify a vector of rows to be read from the file. Therefore, one must read the entire file and subset it.

For reproducibility I'll generate a million rows and 10 columns of data, write them to disk with write.csv() and use readr::read_csv() with the extract operator [. readr::read_csv() runs much faster than base::read.csv().

system.time(x <- matrix(runif(10000000),nrow=1000000,ncol=10))
x <- as.data.frame(x)
system.time(write.csv(x,"./data/random.csv",row.names=FALSE))
# create a sample of 1,000 items from 1,000,000
theRows <- sample(1:1000000,1000)
# now use readr::read_csv
library(readr)
system.time(x <- read_csv("./data/random.csv")[theRows,])
nrow(x)

...and the output:

> system.time(x <- matrix(runif(10000000),nrow=1000000,ncol=10))
   user  system elapsed 
  0.366   0.060   0.427 
> x <- as.data.frame(x)
> system.time(write.csv(x,"./data/random.csv",row.names=FALSE))
   user  system elapsed 
 12.444   0.171  12.745 
> # create a sample of 1,000 items from 1,000,000
> theRows <- sample(1:1000000,1000)
> # now use readr::read_csv
> library(readr)
> system.time(x <- read_csv("./data/random.csv")[theRows,])
Parsed with column specification:
cols(
  V1 = col_double(),
  V2 = col_double(),
  V3 = col_double(),
  V4 = col_double(),
  V5 = col_double(),
  V6 = col_double(),
  V7 = col_double(),
  V8 = col_double(),
  V9 = col_double(),
  V10 = col_double()
)
|==================================================================| 100%  171 MB
   user  system elapsed 
  3.289   0.632   4.750 
> nrow(x)
[1] 1000
> 

Using read.csv() Instead

Here is the performance timing for the same operation with read.csv().

> # for comparison, read.csv timing
> system.time(x <- read.csv("./data/random.csv")[theRows,])
   user  system elapsed 
 51.921   0.818  53.231 
> nrow(x)
[1] 1000
> 

Yes, it takes 10X more time to read the file with read.csv() than readr::read_csv().

Hardware Specification

The performance timings were run on a MacBook Pro with the following configuration.

  • Operating system: macOS 10.13.2 (17C88)
  • Processor: Intel i5 at 2.6Ghz, turbo up to 3.3Ghz, two cores
  • Memory: 8 gigabytes
  • Disk: 512 gigabytes, solid state drive
  • Date built: April 2013
  • Len Greski
    • 10,505
    • 2
    • 22
    • 33
    • used following code but it's not returning 1000 rows but the entire data set. Not sure what I missed here
      set.seed(90514327) theRows <- sample(1:1000000,1000) df <- read.csv(file="[csv file path]",header=TRUE,sep=",",nrows=theRows)
      – Andy Dec 17 '17 at 08:07
    • Hello Andy. the `nrows=` argument in `read.csv()` is not equivalent to the `rows=` argument in `read.xlsx()`. `nrows=` expects a single number, not a vector of row numbers as I used with `read.xlsx()`. Instead, try `df <- read.csv(file="[csv file path]",header=TRUE,sep=",")[theRows,]`, which will read the file, then sample 1,000 items. The OP asked about reading an Excel file, not a comma separated values file. – Len Greski Dec 17 '17 at 10:12
    • I have updated my answer with code that reads a CSV file, including performance timings that compare `readr::read_csv()` with `base::read.csv()`. – Len Greski Dec 17 '17 at 10:47