1

I am working with a dataset of countries, with different values for different points in time. There is one observation each month, so I have used the as.date fucntion such that dates are 01-07-2018, 01-08-2018 etc. For each country and each date exist a corresponding value. I want to transform this into an dataframe, where all the countries values are aggregated at a date. I have tried the reshape function such that

reshape(Origin_wide, idvar = "Origin", timevar = "V5", direction = "wide")

V5 being the date variable and Origin being the country. This is as far as I can tell dropping the as.date set, such that the dates and the corresponding values is no longer treated as such. Can it be done in a more smooth way ? The time series aspect of the data series disseapears ones the reshape function is lost, such that i cannot difference the time series or plot it by date etc. Furthermore the dates are getting a "value" prefix, which is the same problem as far as I can tell.

Here is the data structure as it is:

|       V5            |        Origin    |  Value   |
|---------------------|------------------|----------|
|      01-09-2017     |        USA       |     45   |
|---------------------|------------------|----------|
|      01-10-2017     |        USA       |     47   |
|---------------------|------------------|----------|
|      01-11-2017     |        USA       |     49   |
|---------------------|------------------|----------|
|      01-09-2017     |        Canada    |     7    |
|---------------------|------------------|----------|
|      01-10-2017     |        Canada    |     13   |
|---------------------|------------------|----------|
|      01-11-2017     |        Canada    |     17   |
|---------------------|------------------|----------|

And here is how I would like it to look:

    |       V5            |        Canada    |  USA     |
    |---------------------|------------------|----------|
    |      01-09-2017     |        7         |     45   |
    |---------------------|------------------|----------|
    |      01-10-2017     |        13        |     47   |
    |---------------------|------------------|----------|
    |      01-11-2017     |        17        |     49   |
    |---------------------|------------------|----------|

Hope this makes sense. To reproduce a random small version of the data :

    set.seed(1)
Data <- data.frame(Value = sample(1:10), Origin = sample(c("Mexico", "USA","Canada"), 10, replace = TRUE))
dates <- sample(seq(as.Date('2018/01/01'), as.Date('2018/05/01'), by="month"), 10, replace = TRUE)
Data <- cbind(dates,Data)

A look on the data as produced by the code enter image description here

As it is clear here, the values are not defined for all the dates. When this is the case, the value for that date is = 0. So in my first try with reshape it produces NA's for all the dates where there where no observations, which was perfect, because i was able to just put in 0's.

NelsonGon
  • 13,015
  • 7
  • 27
  • 57
MNielsen
  • 13
  • 4
  • 1
    Could you maybe provide a minimal reproducible example, [link](https://stackoverflow.com/help/mcve) including your data and your expected outcome. – Hjalmar Nov 09 '18 at 22:21
  • Yes of course. It is here now ! – MNielsen Nov 09 '18 at 22:44
  • I appreciate you adding some data, but you are much more likely to get help here if your data is reproducible. The comment above included a link that explains how to do that - please read it. Here is specific r [examples](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Conor Neilson Nov 09 '18 at 23:09
  • I am sorry for my lack of particularity regarding this question. I have added som code that should recreate a version of the data I am working with. Please let me know if you need more. – MNielsen Nov 10 '18 at 00:07

1 Answers1

0

Recreate your data

library(tidyverse)

tbl <- tibble(
  V5 = rep(c("01-09-2017", "01-10-2017", "01-11-2017"), 2),
  Origin = rep(c("USA", "Canada"), each = 3),
  Value = c(45, 47, 49, 7, 13, 17)
)

Code

tbl %>%
  spread(Origin, Value)

Result

# A tibble: 3 x 3
  V5         Canada   USA
  <chr>       <dbl> <dbl>
1 01-09-2017      7    45
2 01-10-2017     13    47
3 01-11-2017     17    49

I am not sure it is a good idea to replace the NA by 0 because those truly are missing values, not values of 0. But if you really want to do this, then you can do:

result <- tbl %>%
  spread(Origin, Value)

result[is.na(result)] <- 0

Note: your toy example is random and does not represent the pattern of your actual data. Running the code on it outputs a pretty ugly wide format. When you create a toy example, try to make sure that it reflects the characteristics of your real data.

prosoitos
  • 6,679
  • 5
  • 27
  • 41
  • Thanks ! This seems to be exactly what I need. I get the dates as column, is there anyway to get them as rownames, but still keep the formatting, such that I am able to perform operation based on the date values? Such that I can take diff() and ge the first difference. If I try this now, i get the error_: Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument to binary operator – MNielsen Nov 10 '18 at 10:03
  • If you want to transform your `V5` variable as rownames, you can use `tibble::column_to_rownames()` (Note that you didn't ask anything about that in your question though, so I won't edit my answer or it would not answer your question anymore). – prosoitos Nov 10 '18 at 15:21
  • I did miss your last question about `NA` though, so I will edit my answer for this. (And if my comment above is not clear enough and if you edit your question, I will add the `column_to_rownames()` in my answer as well). – prosoitos Nov 10 '18 at 15:26
  • Thank you for your help. – MNielsen Nov 10 '18 at 15:28
  • I have tried to figure out was wrong with my example. Could you elaborate on your note regarding this? – MNielsen Nov 10 '18 at 15:46
  • Are you referring to your toy example? If so, you can see that your real sample data has a pattern (`USA` and `Canada` have the same number of observations at the same dates. This leads to a wide format with no `NA`). In contrast, your toy example is created with random sampling without any repetition pattern. So that gives a wide format with lots of `NA` everywhere. – prosoitos Nov 10 '18 at 15:50
  • Sorry, my first example was not representative of my data. My toy example was. I am new to stackoverflow - I will be more precise the next time. Thank you for taking your time to help me, and point out the wrongs ! – MNielsen Nov 10 '18 at 17:35
  • No worries! I am glad I could help – prosoitos Nov 10 '18 at 20:32