83

I have big data set which consist of around 94 columns and 3 Million rows. This file have single as well as multiple spaces as delimiter between columns. I need to read some columns from this file in R. For this I tried using read.table() with options which can be seen in the code below, the code is pasted below-

### Defining the columns to be read from the file, the first 5 column, then we do not read next 24, after this we read next 5 columns. Last 60 columns are not read in-

    col_classes = c(rep("character",2), rep("numeric", 3), rep("NULL",24), rep("numeric", 5), rep("NULL", 60))   

### Reading first 100 rows of the data

    data <- read.table(file, sep = " ",header = F, nrows = 100, na.strings ="", stringsAsFactors= F)

Since, the file which has to read in have more than one space as the delimiter between some of the column, the above method does not work. Is there any method using which we can read in this file efficiently.

Pawan
  • 1,066
  • 1
  • 10
  • 16
  • 6
    Just remove the `sep=" "` argument. `read.table` by default knows how to handle multiple spaces. – Hong Ooi Jun 07 '13 at 09:37
  • 1
    I have a very similar problem, but I need a more general solution since I have single spaces within some fields. Meaning that I should be able to set the minimum number of consecutive spaces (in my case 2) to be considered as a separator, with no limit for it. – EdM Jun 30 '17 at 15:24
  • Related post: https://stackoverflow.com/questions/30955464/reading-aligned-column-data-with-fread – zx8754 Jan 11 '18 at 20:06
  • 1
    @HongOoi: yeah but only because `read.table/read.csv`'s default is sep="", which means "multiple whitespace", which we might expect should be a regex "\w*" or "\w+" not "". – smci Apr 26 '18 at 03:07

3 Answers3

120

You need to change your delimiter. " " refers to one whitespace character. "" refers to any length whitespace as being the delimiter

 data <- read.table(file, sep = "" , header = F , nrows = 100,
                     na.strings ="", stringsAsFactors= F)

From the manual:

If sep = "" (the default for read.table) the separator is ‘white space’, that is one or more spaces, tabs, newlines or carriage returns.

Also, with a large datafile you may want to consider data.table:::fread to quickly read data straight into a data.table. I was myself using this function this morning. It is still experimental, but I find it works very well indeed.

agstudy
  • 119,832
  • 17
  • 199
  • 261
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • how does 'fread' handles multiple spaces?? This was the first read function which I tried using but for me it failed due to multiple spaces, any work around on this?? – Pawan Jun 07 '13 at 11:04
  • @user2412678 Did you try `fread(... , sep = "" )` or alternatively you could try `fread( ... , sep = "\s" )` but I don't know if that will work. Could you try both and report back, then we can update the answer for `fread` if one of them works. – Simon O'Hanlon Jun 07 '13 at 11:29
  • 5
    `fread( ...,sep ="")` does not work in `fread`, we get the following error when we use this - `Error in fread(file, sep = "", : 'sep' must be 'auto' or a single character` `fread(....,sep = "\s")` does not work in `fread`, we get following error in this case `Error: '\s' is an unrecognized escape in character string starting ""\s"` However, `fread(...,sep = " " )` work, but this does not solves the problem of multiple space as delimiter, it treats multiple space as column – Pawan Jun 07 '13 at 14:29
12

If you want to use the tidyverse (or readr respectively) package instead, you can use read_table instead.

read_table(file, col_names = TRUE, col_types = NULL,
  locale = default_locale(), na = "NA", skip = 0, n_max = Inf,
  guess_max = min(n_max, 1000), progress = show_progress(), comment = "")

And see here in the description:

read_table() and read_table2() are designed to read the type of textual data where
each column is #' separate by one (or more) columns of space.
Revan
  • 2,072
  • 4
  • 26
  • 42
4

If you field have a fixed width, you should consider using read.fwf() which might handle missing values better.

cmbarbu
  • 4,354
  • 25
  • 45