Read csv with two headers into a data.frame

Question

Apologies for the seemingly simple question, but I can't seem to find a solution to the following re-arrangement problem.

I'm used to using read.csv to read in files with a header row, but I have an excel spreadsheet with two 'header' rows - cell identifier (a, b, c ... g) and three sets of measurements (x, y and z; 1000s each) for each cell:

a           b       
x    y  z   x   y   z
10   1  5   22  1   6
12   2  6   21  3   5
12   2  7   11  3   7
13   1  4   33  2   8
12   2  5   44  1   9

csv file below:

a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9

How can I get to a data.frame in R as shown below?

cell x  y   z
a    10 1   5
a    12 2   6
a    12 2   7
a    13 1   4
a    12 2   5
b    22 1   6
b    21 3   5
b    11 3   7
b    33 2   8
b    44 1   9

You will find that you get better answers if you take the time to make your question reproducible. Please follow the guidelines (http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), paying special attention to the part about `dput()`. Thanks! — Ari B. Friedman, Aug 16 '12 at 12:28
@AriB.Friedman, `dput()` is nice and I generally appreciate it, but how would you go about using `dput()` for a `read.csv()` problem? Hosting the CSV on Dropbox might be an alternative, but questions of permanence arise. I think *this* question--even without the edit by the OP--is reasonable, though a *little* bit more work for the rest of us. — A5C1D2H2I1M1N2O1R2T1, Aug 16 '12 at 16:04
@mrdwab I was thinking he could `dput(read.csv())`, since this isn't necessarily a read.csv problem but a reshaping post-read-in. — Ari B. Friedman, Aug 16 '12 at 17:54
@AriB.Friedman, actually, I didn't even know you can use `dput()` that way. Interesting to know! Thanks! Still, my "reading" was different, though. I guess I saw it more as a problem of *first* working with files with multiple column headers, and *second*, reshaping the data. I came to that conclusion from the second para, where the OP said they are used to using `read.csv()` but were somewhat stuck at the two header rows issue. — A5C1D2H2I1M1N2O1R2T1, Aug 16 '12 at 18:04
@mrdwab `read.csv()` returns a `data.frame` so you can `dput()` it. Ain't R grand! But point taken, and I enjoyed your explication. — Ari B. Friedman, Aug 16 '12 at 18:46
@AriB.Friedman, thanks for the comment. Like @mrdwab it wasn't initially clear to me that this was a case for `dput{)`, that's why I added the `csv` contents, but the comments here are useful. — user441706, Aug 17 '12 at 08:22

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2012-08-16T15:57:00.130

6

Use base R reshape():

temp = read.delim(text="a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9", header=TRUE, skip=1, sep=",")
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT
#     time  x y z id
# 1.0    0 10 1 5  1
# 2.0    0 12 2 6  2
# 3.0    0 12 2 7  3
# 4.0    0 13 1 4  4
# 5.0    0 12 2 5  5
# 1.1    1 22 1 6  1
# 2.1    1 21 3 5  2
# 3.1    1 11 3 7  3
# 4.1    1 33 2 8  4
# 5.1    1 44 1 9  5

Basically, you should just skip the first row, where there are the letters a-g every third column. Since the sub-column names are all the same, R will automatically append a grouping number after all of the columns after the third column; so we need to add a grouping number to the first three columns.

You can either then create an "id" variable, or, as I've done here, just use the row names for the IDs.

You can change the "time" variable to your "cell" variable as follows:

# Change the following to the number of levels you actually have
OUT$cell = factor(OUT$time, labels=letters[1:2])

Then, drop the "time" column:

OUT$time = NULL

Update

To answer a question in the comments below, if the first label was something other than a letter, this should still pose no problem. The sequence I would take would be as follows:

temp = read.csv("path/to/file.csv", skip=1, stringsAsFactors = FALSE)
GROUPS = read.csv("path/to/file.csv", header=FALSE, 
                  nrows=1, stringsAsFactors = FALSE)
GROUPS = GROUPS[!is.na(GROUPS)]
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT$cell = factor(temp$time, labels=GROUPS)
OUT$time = NULL

edited Aug 16 '12 at 15:57

answered Aug 16 '12 at 13:06

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

What if the OP has another first label other than a trivial letter? Would it be possible to read only the first line? – Luciano Selzer Aug 16 '12 at 13:41
@lselzer, I've added an update. I don't think it's difficult--you just need to create an object that reads in just the first line and uses that for the "`cell`" labels. – A5C1D2H2I1M1N2O1R2T1 Aug 16 '12 at 15:58
@mrdwab thanks for the answer. It worked fine for me. I'm sort of glad it wasn't something entirely trivial. – user441706 Aug 17 '12 at 08:24
@mrdwab, while the first approach worked, for an extended file with lablels other than letters the penultimate line of your update fails with: `> OUT$cell = factor(temp$time, labels=GROUPS) Error in factor(temp$time, labels = GROUPS) : invalid labels; length 9 should be 1 or 0` – user441706 Aug 17 '12 at 09:41
@user441706, difficult to say without seeing the actual data. You might need to add `levels` to your `factor()` statement too. Double-check your `GROUPS` to make sure that it is the same length as the number of factor levels. (In your simplified example, the number of groups would be 2 (`a` and `b`), as are the number of factor levels.) – A5C1D2H2I1M1N2O1R2T1 Aug 17 '12 at 09:47
@mrdwab, Thanks, I changed your `OUT$cell = factor(temp$time, labels=GROUPS)` to `OUT$cell = factor(OUT$time, labels=GROUPS)` and preceeded it with `levels(OUT$time) <- length(GROUPS)` – that seemed to work. – user441706 Aug 17 '12 at 10:08

Read csv with two headers into a data.frame

1 Answers1

Update

Linked

Related