0

I'm reading in many large tab-separated .txt files using read.table in R. However, some lines contain newline breaks (\n) where there should be tabs (\t), which causes an Error in scan(...). How can I deal with this issue robustly? (Is there a way to replace \n-->\t every time scan encounters an error?)

Edit:

Here's a simple example:

read.table(text='a1\tb1\tc1\td1\n
                 a2\tb2\tc2\td2', sep='\t')

works fine, and returns a data frame. However, suppose there is, by some mistake, a newline \n where there should be a tab \t (e.g., after c1):

read.table(text='a1\tb1\tc1\nd1\n
                 a2\tb2\tc2\td2', sep='\t')

This raises an error:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
    line 1 did not have 4 elements

Note: Using fill=T won't help, because it will push d1 to a new row.

sirallen
  • 1,947
  • 14
  • 21
  • Look at the documentation for scan (`?scan`) - I think that adding the argument `fill=TRUE` should help: "logical: if `TRUE`, scan will implicitly add empty fields to any lines with fewer fields than implied by `what`." – Marc in the box Apr 26 '15 at 06:46
  • @Marcinthebox From what I understand, that's used to deal with missing data... using `fill=T` in my case would split the data into separate rows, which is not what I want – sirallen Apr 26 '15 at 06:49
  • 1
    http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example can you provide a relevant file snippet – npjc Apr 26 '15 at 07:04
  • is your question answered? if not try to specifiy what's' missing. – npjc May 12 '15 at 09:00

1 Answers1

1

if you do have the exact problem you describe (no missing data, just wrong seperator) try:

library(readr)
initial_lines <- read_lines('a1\tb1\tc1\nd1\na2\tb2\tc2\td2')

seperated_together <- unlist(strsplit(initial_lines, "\t", fixed = T))

matrix(seperated_together, ncol = 4)

gives:

     [,1] [,2] [,3] [,4]
[1,] "a1" "c1" "a2" "c2"
[2,] "b1" "d1" "b2" "d2"

and transform this as you wish wish.

if you have missing data/complications then you'll have to:

strsplit(initial_lines,'\t',fixed=T)

which gives:

[[1]]
[1] "a1" "b1" "c1"

[[2]]
[1] "d1"

[[3]]
[1] "a2" "b2" "c2" "d2"  

and you'll have to parse through elements combining based on number of elements.

You could also have a look at ?count_fields in readr.

npjc
  • 4,134
  • 1
  • 22
  • 34