0

I extensively searched and found multiple questions and answers regarding similar situations but not what I'm facing here, so i'm relying on making this question.

I have a messy dataframe in R that have this structure, I obtained it just by running a query inside R after connecting to the database that have all the data stored.

id     location     year     category     subCategory     numericValue  Unit
 1      place_1      1960      cat_1          subcat_1           34.567  kg
 2      place_1      1961      cat_1          subcat_1          564.566  kg
 3      place_1      1962      cat_1          subcat_1           3.4356  kg
 4      place_1      1963      cat_1          subcat_1           654.44  kg
 5      place_1      1960      cat_1          subcat_2           409.65  cm
 6      place_1      1961      cat_1          subcat_2           985545  cm
 7      place_1      1962      cat_1          subcat_2           5436.4  cm
 8      place_1      1963      cat_1          subcat_2           324.23  cm
...        
...
...     
341      place_1      1960      cat_2          subcat_1          487.65  cm³
342      place_1      1961      cat_2          subcat_1          434.65  cm³
343      place_1      1962      cat_2          subcat_1          421.65  cm³
344      place_1      1963      cat_2          subcat_1           87.65  cm³
345      place_1      1960      cat_2          subcat_2            0.33  ton
346      place_1      1961      cat_2          subcat_2            1.65  ton
347      place_1      1962      cat_2          subcat_2            89.2  ton
348      place_1      1963      cat_2          subcat_2            1345  ton
...
...
12334    place_2      1960      cat_1          subcat_1            1111  kg
12334    place_2      1961      cat_1          subcat_1            2222  kg
12334    place_2      1962      cat_1          subcat_1            3333  kg
12334    place_2      1960      cat_1          subcat_2            4444  cm
12334    place_2      1961      cat_1          subcat_2            5555  cm
12334    place_2      1962      cat_1          subcat_2            6666  cm
...
...
...
99999    place_67     1982      cat_5          subcat_9           3455 Watt  

To try and explain what's going on: I have a numeric value, the actual data, measured and categorized into a category and subcategory, it was obtained in a location on a given year. There are blocks of years for the combination of category+subcategory for a given location and it goes on until all locations were observed within all years and every category and it's subcategories.

What I want to obtain is a dataframe that have each category+subcategory combined and have the data measured under it's value, effectively making a very long dataframe into a very wide one.

id  location  year  cat_1.subcat_1  cat_1.subcat_2  ...  cat_5.subcat_9
 1  place_1   1960       34.567          409.65                  NA
 2  place_1   1961      564.566          985545                  NA
 3  place_1   1962       3.4356          5436.4                 444
 4  place_1   1963       654.44          324.23                 0.5
 5  place_2   1960         1111            4444            3.22e-04
...
350 place_67  1982           NA              NA                3455

I can sort of achieve this with a procedure I made on the database that have this all stored but it's very very costly to run it. I have no idea of what to do with the Unity column, I don't think I can incorporate it anywhere, cause It's unique to the measurement itself and not a combination like the category/subcategory but it's not critical if it ends up thrown away.

I can answer any doubts if things are not very clear.

Thanks!

thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • I see it is similar to my issue indeed. But is there a way to keep the Unit column? – RafaeL Stavale Oct 27 '17 at 20:23
  • 'Unit' is just a constant within a group, so add it to the list of id variables along with id/location. – thelatemail Oct 27 '17 at 20:28
  • I tried your answer on the question you aforementioned using reshape and applying Unit to the vector of id variables, but unfortunately the method did not work , with or without Unit, showing the following error Error in paste(rep(l, length(lvs)), rep(lvs, each = length(l)), sep = sep) cannot allocate vector of length 1695223112 But on the good side the tidyverse response on the same question you appointed duplicated worked, but of course without Unit. It is a loss but one that is not critical as i mentioned – RafaeL Stavale Oct 27 '17 at 21:07
  • Wow... how big is your data that you ended up with a 1.7 **b**illion length vector? I think something has gone wrong there, clearly. Maybe `id` shouldn't be part of the idvars as it is essentially just a row number. – thelatemail Oct 27 '17 at 21:11
  • Thank you very much for pointing out another question that helped me workaround my issue.:) – RafaeL Stavale Oct 27 '17 at 21:17
  • Yeah this is my little big data, the ID column is disposable, just a remnant of the database – RafaeL Stavale Oct 27 '17 at 21:17
  • But yeah it's not that big, i mean that vector was somehow misconstructed, i dont have stuff on the scale of billions. – RafaeL Stavale Oct 27 '17 at 21:18

1 Answers1

0

You may look for that (<3 tidyverse <3 ):

 ## Building fake dataset
  df <- expand.grid(location = c("place1", "place2"),
              catg = c("cat1", "cat2"),
              subcateg = c("subcat1", "subcat2", "subcat3"),
              year = (2000:2001))
  df$value <- rpois(nrow(df), 10)

## the operation you may look for
  df %>% unite(catg.subcateg, catg,subcateg) %>% spread(catg.subcateg, value)
Flecflec
  • 233
  • 3
  • 9
  • I appreciate the answer and I tried applying it to the problem, with some variances as well by my part, but I didn't obtain the desired result, the way it interacts with Unit column and/or other columns is not expected. But still thanks ! – RafaeL Stavale Oct 27 '17 at 21:14