How to work with Unstructured Excel Column/Header names which are spread across multiple rows using R

Question

There is an excel sheet http://www.censusindia.gov.in/2011census/C-series/C08.html Please refer to the Excel sheet For row "India" for Column "C-08".

I want to analyse these data in R.However , the Excel Headers or column names are unstructured .Some headers are located in the first row, others are located in either the 2nd,3rd, or 4th row. Beneath the 4th row is the first subset of data we want to generate graphs from, there are multiple subsets as you go down the excel sheet. Each of these subsets is separated by an empty row. The excel sheet isn't in a format that can be analysed in R.

Please suggest some solution to the issue . Thank you so much in advance!!

`xlsx::read.xlsx(..., startRow=8)` Step one is just to read the file in and set proper headers afterwards. You could write rules to assign column names based on the headers, but depending on what you're trying to do it may be more trouble than it's worth. — Mako212, Sep 18 '17 at 21:55
I cant seem to find any empty rows in the sheet apart from within the first 7: http://www.censusindia.gov.in/2011census/C-series/C-08/DDW-0000C-08.xlsx — jojomojo, Sep 18 '17 at 21:57
Thank you so much @Mako212 for suggesting the alternative. Just curious to know , if in real life project cycle analysts will go with the approach as suggested by you . — Priya Das, Sep 19 '17 at 17:48
@PriyaDas If this was a one time thing for one file, I'd probably just do what I mentioned above. If I had 100 files with a consistent pattern I could take advantage of, I'd definitely look at writing a programmatic solution to assign the headers. There's some inconsistencies with your file that make this more than a trivial process to do, bu if you look [here](https://stackoverflow.com/questions/11987103/read-csv-with-two-headers-into-a-data-frame) you can see one approach. — Mako212, Sep 19 '17 at 18:16
@PriyaDas If you were going to try and assign the headers with R, you'd read each line of the header separately (i.e. `h1`, `h2`, `h3`, etc.), repeat each one as necessary, and then use `paste` to combine them. Even doing that gives you long, unwieldy names that will be annoying to use though. You'd end up with `Literate_Without_Education_level_Males`. If I were setting headers I might choose something more efficient like `lit_no_edu_lvl_M`. — Mako212, Sep 19 '17 at 18:17
@PriyaDas As a final thought though, say you have a sheet that looks just like this for say 20 years. Assuming the headers are the same, there's no reason to muck around trying to create the names with R from the source file. Just define your `names()` vector once, and then assign that header to each new file (again, assuming the headers are consistent between files) — Mako212, Sep 19 '17 at 18:20

How to work with Unstructured Excel Column/Header names which are spread across multiple rows using R

0 Answers0