2

I have a big data file (~1GB) and I want to split it into smaller ones. I have R in hand and plan to use it.

Loading the whole into memory cannot be done as I would get the "cannot allocate memory for vector of xxx" error message.

Then I want to use the read.table() function with the parameters skip and nrows to read only parts of the file in. Then save out to individual files.

To do this, I'd like to know the number of lines in the big file first so I can workout how many rows should I set to individual files and how many files should I split into.

My question is: how can I get the number of lines from the big data file without fully loading it into R?

Suppose I can only use R. So cannot use any other programming languages.

Thank you.

Steve
  • 4,935
  • 11
  • 56
  • 83

2 Answers2

1

Counting the lines should be pretty easy -- check this tutorial http://www.exegetic.biz/blog/2013/11/iterators-in-r/ (the "iterating through lines part). The gist is to use ireadLines to open an iterator over the file

Yana K.
  • 1,926
  • 4
  • 19
  • 27
  • 2
    This is a good suggestion. But please write a piece of example code that will find out the number of lines in the file, without linking to an external tutorial. – Alex Jul 03 '15 at 00:52
1

For Windows, something like this should work

fname <- "blah.R"  # example file
res <- system(paste("find /v /c \"\"", fname), intern=T)[[2]]
regmatches(res, gregexpr("[0-9]+$", res))[[1]]
# [1] "39"
Community
  • 1
  • 1
Rorschach
  • 31,301
  • 5
  • 78
  • 129