5

Let's say we have a file name test.txt which contains unknown number of columns:

1   2   3   4   5
1   2   3   4   5
1   2   3   4   5
1   2   3   4   5
1   2   3   4   5
1   2   3   4   5
1   2   3   4   5
1   2   3   4   5   6   7   8
1   2   3   4   5
1   2   3   4   5   6
1   2   3   4   5   6
1   2   3   4   5   6

fill=T fails when line 8 has more than 5 columns:

read.table('test.txt', header=F, sep='\t', fill=T)

results:

   V1 V2 V3 V4 V5
1   1  2  3  4  5
2   1  2  3  4  5
3   1  2  3  4  5
4   1  2  3  4  5
5   1  2  3  4  5
6   1  2  3  4  5
7   1  2  3  4  5
8   1  2  3  4  5
9   6  7  8 NA NA
10  1  2  3  4  5
11  1  2  3  4  5
12  6 NA NA NA NA
13  1  2  3  4  5
14  6 NA NA NA NA
15  1  2  3  4  5
16  6 NA NA NA NA

But with skip=3, everything works fine

read.table('test.txt', header=F, sep='\t', fill=T, skip=3)

We got what we expected:

  V1 V2 V3 V4 V5 V6 V7 V8
1  1  2  3  4  5 NA NA NA
2  1  2  3  4  5 NA NA NA
3  1  2  3  4  5 NA NA NA
4  1  2  3  4  5 NA NA NA
5  1  2  3  4  5  6  7  8
6  1  2  3  4  5 NA NA NA
7  1  2  3  4  5  6 NA NA
8  1  2  3  4  5  6 NA NA
9  1  2  3  4  5  6 NA NA

Why would this happen? Was it because fill=T only check the first 5 rows? Is there any way to work around this?

Gahoo
  • 215
  • 2
  • 9
  • 1
    According to `?read.table` `The number of data columns is determined by looking at the first five lines of input (or the whole file if it has less than five lines), or from the length of ‘col.names’ if it is specified and is longer. This could conceivably be wrong if ‘fill’ or ‘blank.lines.skip’ are true, so specify ‘col.names’ if necessary (as in the ‘Examples’).` – akrun Aug 18 '15 at 07:27
  • 1
    Thank you for you quick response. I've found the anwser right in the Examples. – Gahoo Aug 18 '15 at 07:32

2 Answers2

5

I've found the answers right in the Examples of read.table.

ncol <- max(count.fields('test.txt', sep = "\t"))
read.table('test.txt', header=F, sep='\t', fill=T, col.names=paste0('V', seq_len(ncol)))

It did because of fill=T only checks the first five rows. The solution is to specify col.names.

Gahoo
  • 215
  • 2
  • 9
2

use col.names = paste0("V",seq_len(N)) within read.table where N is the maximum number of columns.

drmariod
  • 11,106
  • 16
  • 64
  • 110