2

Possible Duplicate:
Read csv with two headers into a data.frame

I am new to R and working to use R to analyse some data. The data happens to be in Excel format and right now I'm struggling to find a way to convert it into a format that is R-friendly.

The issue is that the column headers have merged cells, so in effect the headers have two rows. I'd like to convert it into a normal set of 1-D vectors, add an extra column and a row. Let me explain with an example:

Currently the excel format looks thus:

  |  H  |  J  | 
Y |M |F |M |F | 
== == == == == 
Y1|V1|V2|V3|V4|

H,J are merged column headers and each of them span columns M and F.

The = indicate that the rows above are header rows

Given that H,J both are elements under, say R, I would like to convert this into a columnar format with a normal header and two rows, like this

Y |R |M |F |
== == == ==
Y1|H |V1|V2|
Y1|J |V3|V4|

Does anyone have an idea how to do this?

Community
  • 1
  • 1
Romit
  • 23
  • 5
  • 2
    Does [this question](http://stackoverflow.com/q/11987103/1270695) help you get started? – A5C1D2H2I1M1N2O1R2T1 Aug 26 '12 at 16:57
  • Welcome to Stack Overflow! If you made a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that demonstrates your question / problem, we would find it easier to answer. – Andrie Aug 26 '12 at 17:04
  • Thanks @mrdwab you're right, the question asked is the same as referred to in that post. I must have missed searching for it! – Romit Aug 26 '12 at 17:22
  • 1
    @Romit, I've posted an answer for reference, but since you've mentioned that your problem is the same as the linked question, I'm recommending closing this question as a duplicate of the other one. – A5C1D2H2I1M1N2O1R2T1 Aug 26 '12 at 18:07

1 Answers1

1

First, some assumptions:

  • The merged headings are on the first line of the CSV
  • The merged headings start in the second column of the CSV
  • The variable names in the second line of the CSV repeat (except for the variable in the first column)

Second, your data.

temp = c(",\"H\",,\"J\",", 
         "\"Y\",\"M\",\"F\",\"M\",\"F\"", 
         "\"Y1\",\"V1\",\"V2\",\"V3\",\"V4\"")

Third, a slightly modified version of this answer.

# check.names is set to FALSE to allow variable names to be repeated
ONE = read.csv(textConnection(temp), skip=1, check.names=FALSE,
               stringsAsFactors=FALSE)
GROUPS = read.csv(textConnection(temp), header=FALSE, 
                  nrows=1, stringsAsFactors=FALSE)
GROUPS = GROUPS[!is.na(GROUPS)]

# This can be shortened, but I've written it this way to show how
#   it can be generalized. For instance, if 3 columns were repeated
#   instead of 2, the rep statement could be changed to reflect that
names(ONE)[-1] = paste0(names(ONE)[-1], ".",
                        rep(GROUPS, each=(length(names(ONE)[-1])/2)))

Fourth, the actual reshaping of the data.

TWO = reshape(ONE, direction="long", ids=1, varying=2:ncol(ONE))
# And, here's the output.
TWO
#      Y time  M  F id
# 1.H Y1    H V1 V2  1
# 1.J Y1    J V3 V4  1
Community
  • 1
  • 1
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485