1

I have a data set containing data sorted in rows like this:

*VarName1* - *VarValue1*
*VarName2* - *VarValue2*
*Etc.*

I want it to be that the VarNames become individual columns. I have achieved this by using the following code:

DFP1 <- as.data.frame(t(DFP)) #DFP contains the data

Now, this is a very big data set. It contains multiple years (millions of rows) of data. Above code creates a dataframe which has > 1E6 columns. I need to split these columns by each entry. I saw that in the first piece of data, a new entry recurs at every 86th column. So, I tried this:

tmp <- data.frame(
       X = DFP$noFloat,
       ind = rep(1:86, nrow(DFP)/86)
)

y <- rbind(DFP$nmlVar[1:86], unstack(tmp, X~ind))

This works for a few rows. The problem is that the number of variables increased over the years and that I cannot simply assume that the number of variables per entry are the same. This results in variable values mismatching it's names. I am looking for a way to match variables and values based on their variable names.

I am new to advanced data-analysis, so please let me know if you need anything more.

EDIT: I created some sample data of how DFP looks like, to hopefully make you better understand my question:

DFP <- data.frame(
    nmlVar = c("Batch", "Mass", "Length", "Product","Batch", "Mass", 
    "Length", "Product", "Batch", "Mass", "Length", "Width", "Product"),

    noFloat = c(254578, 20, 24, 24547, 254579, 23, 24, 24547, 254580, 20, 
    24, 19, 24547)
)

Important to note here is the apperance of new variable width in the third recurrence. This is typical for my dataset, introduction of new variables. The key indicator here is batch and it should be split at each time the variable batch appears.

dput output of sample data:

structure(list(nmlVar = structure(c(1L, 3L, 2L, 4L, 1L, 3L, 2L, 
4L, 1L, 3L, 2L, 5L, 4L), .Label = c("Batch", "Length", "Mass", 
"Product", "Width"), class = "factor"), noFloat = c(254578, 20, 
24, 24547, 254579, 23, 24, 24547, 254580, 20, 24, 19, 24547)), .Names = c("nmlVar", 
"noFloat"), row.names = c(NA, -13L), class = "data.frame")
Kevin
  • 61
  • 6
  • 2
    Can you please give an example of what the actual data.frame `DFP` looks like with `dput`. (Not the size but the structure.) – Axeman Oct 02 '17 at 08:42
  • 1
    Related: (1) [*Transpose / reshape dataframe without “timevar” from long to wide format*](https://stackoverflow.com/questions/11322801/transpose-reshape-dataframe-without-timevar-from-long-to-wide-format) or (2) [*How to reshape data from long to wide format?*](https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format) – Jaap Oct 02 '17 at 08:45
  • Use `tidyr` and `dplyr`. Check out the Rstudio data import and transformation cheat sheets https://www.rstudio.com/resources/cheatsheets/ – Scransom Oct 02 '17 at 08:47
  • @Axeman I am unable to do so by confidentiality of the project. Is there any other way I can help you to get a look at the structure without revealing column names nor data? – Kevin Oct 02 '17 at 08:51
  • 1
    See: [*How to make a great R reproducible example*](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610) – Jaap Oct 02 '17 at 08:52
  • If you transpose you will get, for example, three columns of `MASS`. Are you sure you don't want to just make it wide? i.e. `library(tidyverse); DFP %>% group_by(nmlVar) %>% mutate(new = seq(n())) %>% spread(nmlVar, noFloat)`? – Sotos Oct 02 '17 at 09:14
  • @Sotos, `width = 19` belongs to the last `Batch` (I think), your method puts it in the first. – Axeman Oct 02 '17 at 09:16
  • @Axeman oh, I see... – Sotos Oct 02 '17 at 09:17

1 Answers1

2

Is this what you are after?:

library(dplyr)
library(tidyr)
DFP %>% 
  mutate(sample = cumsum(nmlVar == 'Batch')) %>% 
  spread(nmlVar, noFloat)

Gives:

  sample  Batch Length Mass Product Width
1      1 254578     24   20   24547    NA
2      2 254579     24   23   24547    NA
3      3 254580     24   20   24547    19
Axeman
  • 32,068
  • 8
  • 81
  • 94