33

What is the best way to read a file into R when the header has two necessary lines for the header?

This happens to me all the time, as people often use one line for the column name and then include another line underneath it for the unit of measurement. I don't want to skip anything. I want the names and the units to carry through.

Here is what a typical file with two headers might look like:

trt   biomass    yield
crop    Mg/ha    bu/ac
C2      17.76   205.92
C2      17.96   207.86
CC      17.72   197.22
CC      18.42   205.20
CCW     18.15   200.51
CCW     17.45   190.59
P       3.09    0.00
P       3.34    0.00
S2      5.13    49.68
S2      5.36    49.72
pnuts
  • 58,317
  • 11
  • 87
  • 139
Nazer
  • 3,654
  • 8
  • 33
  • 47
  • the file you linked to has a one-line header... Please update and tell us what the expected data.frame column names should be. – flodel Jul 22 '13 at 21:39
  • 6
    The way to handle this situation is not to re-ask the question, but to put a bounty on the old question. – joran Jul 22 '13 at 21:39
  • I agree with @joran. If your question is substantially different, please explain the differences. – Joshua Ulrich Jul 22 '13 at 21:41
  • possible duplicate of [Reading in files with two header rows](http://stackoverflow.com/questions/2293131/reading-in-files-with-two-header-rows) – Gavin Simpson Jul 22 '13 at 21:42
  • @joran, while I agree with you in general, (i believe that) someone with only 45 rep cannot place a bounty – Ricardo Saporta Jul 22 '13 at 21:42
  • Sorry, the file is updated. – Nazer Jul 22 '13 at 21:43
  • @RicardoSaporta Nothing stopping them going back and editing the Q explain why they think there could be better Answers. – Gavin Simpson Jul 22 '13 at 21:44
  • @RicardoSaporta You're right, that privilege kicks in at 75. Still, I don't see how this isn't a duplicate. – joran Jul 22 '13 at 21:44
  • The question is just more general. The original question is lengthy and specific to a more complicated dataset. I'm not sure if that is what discouraged people from answering. – Nazer Jul 22 '13 at 21:51
  • Well, so far all you've done is get the solution presented in the old question again. I agree the solutions there could be cleaned up somewhat, but you're not going to get anything very different. – joran Jul 22 '13 at 21:56
  • @joran, I fully agree it is a duplicate (and voted as such).. I was commenting to the idea of offering a bounty – Ricardo Saporta Jul 22 '13 at 22:00
  • 4
    It may have been a duplicate but it was three years old and neither answer was responsive. The first answer was just wrong and hte seond was not using R. Furthermore the data offered was a mess. – IRTFM Jul 22 '13 at 22:04
  • 1
    @DWin I'm convinced. I retracted my close vote, and voted to close the old question as a dupe of this one. – joran Jul 22 '13 at 22:34

5 Answers5

25

I would do two steps, assuming we know that the first row contains the labels, and there are always two headers.

header <- scan("file.txt", nlines = 1, what = character())
data <- read.table("file.txt", skip = 2, header = FALSE)

Then add the character vector header on as the names component:

names(data) <- header

For your data this would be

header <- scan("data.txt", nlines = 1, what = character())
data <- read.table("data.txt", skip = 2, header = FALSE)
names(data) <- header

head(data)

>     head(data)
  trt biomass  yield
1  C2   17.76 205.92
2  C2   17.96 207.86
3  CC   17.72 197.22
4  CC   18.42 205.20
5 CCW   18.15 200.51
6 CCW   17.45 190.59

If you want the units, as per @DWin's answer, then do a second scan() on line 2

header2 <- scan("data.txt", skip = 1, nlines = 1, what = character())
names(data) <- paste0(header, header2)

> head(data)
  trtcrop biomassMg/ha yieldbu/ac
1      C2        17.76     205.92
2      C2        17.96     207.86
3      CC        17.72     197.22
4      CC        18.42     205.20
5     CCW        18.15     200.51
6     CCW        17.45     190.59
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
11

Use readLines with 2 for the limit, parse it, paste0 them together, then read in with read.table with skip =2 and header=FALSE (the default). Finish the process off with assignment of the column names:

dat <- "trt biomass yield
 crop   Mg/ha   bu/ac
 C2 17.76   205.92
 C2 17.96   207.86
 CC 17.72   197.22
 CC 18.42   205.20
 CCW    18.15   200.51
 CCW    17.45   190.59
 P  3.09    0.00
 P  3.34    0.00
 S2 5.13    49.68
 S2 5.36    49.72
 "

You would probably use a file argument but using the text argument to the read-functions makes this more self-contained:

 readLines(textConnection(dat),n=2)
#[1] "trt\tbiomass\tyield" "crop\tMg/ha\tbu/ac" 
 head2 <- read.table(text=readLines(textConnection(dat),n=2), sep="\t", stringsAsFactors=FALSE)
 with(head2, paste0(head2[1,],head2[2,]) )
# [1] "trtcrop"      "biomassMg/ha" "yieldbu/ac"  
 joinheadrs <- with(head2, paste0(head2[1,],head2[2,]) )

newdat <- read.table(text=dat, sep="\t",skip=2)
colnames(newdat)<- joinheadrs
#-------------------
> newdat
   trtcrop biomassMg/ha yieldbu/ac
1       C2        17.76     205.92
2       C2        17.96     207.86
3       CC        17.72     197.22
4       CC        18.42     205.20
5      CCW        18.15     200.51
6      CCW        17.45     190.59
7        P         3.09       0.00
8        P         3.34       0.00
9       S2         5.13      49.68
10      S2         5.36      49.72

Might be better to use paste with an underscore-sep:

joinheadrs <- with(head2, paste(head2[1,],head2[2,] ,sep="_")  )
joinheadrs
#[1] "trt_crop"      "biomass_Mg/ha" "yield_bu/ac"  
IRTFM
  • 258,963
  • 21
  • 364
  • 487
11

Almost the same method to the other answers, just shortening to 2 statements:

dat <- "trt   biomass    yield
crop    Mg/ha    bu/ac
C2      17.76   205.92
C2      17.96   207.86
CC      17.72   197.22
CC      18.42   205.20
CCW     18.15   200.51
CCW     17.45   190.59
P       3.09    0.00
P       3.34    0.00
S2      5.13    49.68
S2      5.36    49.72"

header <- sapply(read.table(text=dat, nrow=2), paste, collapse="_")
result <- read.table(text=dat, skip=2, col.names=header)

Result:

> head(result,2)
  trt_crop biomass_Mg/ha yield_bu/ac
1       C2         17.76      205.92
2       C2         17.96      207.86
...
thelatemail
  • 91,185
  • 12
  • 128
  • 188
1

A slightly different explained step by step approach:

  1. Read only the first two lines of the files as data (without headers):

    headers <- read.table("data.txt", nrows=2, header=FALSE)
    
  2. Create the headers names with the two (or more) first rows, sappy allows to make operations over the columns (in this case paste) - read more about sapply here :

    headers_names <- sapply(headers,paste,collapse="_")
    
  3. Read the data of the files (skipping the first 2 rows):

    data <- read.csv(file="data.txt", skip = 2, header=FALSE)
    
  4. And assign the headers of step two to the data:

    names(data) <- headers_names
    

The advantage is that you would have clear control of the the parameters of read.table (such as sep for commas, and stringAsFactors - for both the headers and the data)

toto_tico
  • 17,977
  • 9
  • 97
  • 116
0

Here is a function to read in headers over multiple lines, largely based on Gavin Simpson's excellent answer.

The function defaults to comma separated values and two lines of header, and returns a data.frame with the first line in the file as header.

The function:

read.multi.line.header <- function(path, header.lines = 2, sep = ","){

  header <- scan(path, nlines = 1, what = character(), sep = sep)

  data <- read.table(path, skip = header.lines, header = FALSE, sep = sep)

  base::names(data) <- header

  return(data)
}

Produces:

mydata <- read.multi.line.header(path = "data.txt")

> head(mydata)
      trt      biomass      yield
1      C2        17.76     205.92
2      C2        17.96     207.86
3      CC        17.72     197.22
4      CC        18.42     205.20
5     CCW        18.15     200.51
6     CCW        17.45     190.59
Scransom
  • 3,175
  • 3
  • 31
  • 51