1

I am trying to reshape data but keep having errors. There are very helpful threads here on reshaping but I have not seen my particular issue. I am trying to go from long to wide - but have both country name and date as the id - typically I have seen solutions for making date the new columns.

#Data from:
library(wbstats)
WorldBank_long <- wb(indicator = c("NY.GDP.PCAP.KD", "SI.POV.GINI", "UNEMPSA_","CPTOTSAXMZGY" ),
                     startdate = 2005, enddate = 2019)
WorldBank <- WorldBank_long[,c("iso3c", "date", "value", "indicatorID")]

Columns in "WorldBank" are "iso3c" & "date" - which are both ID variables. "indicatorID" is the value I want to make new column names and "value" is the value I want for each new column.

I have tried:

library(tidyr)
WorldBank_data_wide<-WorldBank %>% 
  pivot_wider(names_from = indicatorID, values_from = value)

Error: Each row of output must be identified by a unique combination of keys. Keys are shared for 403 rows:

  • 6086, 6101, 6116, 6131, 6146, 6176, 6191, 6206, 6221, 6236, 6251, 6266, 6296, 6431
  • 6085, 6100, 6115, 6130, 6145, 6175, 6190, 6205, 6220, 6235, 6250, 6265, 6295, 6430
  • 6084, 6099, 6114, 6129, 6144, 6174, 6189, 6204, 6219, 6234, 6249, 6264, 6294, 6429
  • 6083, 6098, 6113, 6128, 6143, 6173, 6188, 6203, 6218, 6233, 6248, 6263, 6293, 6428
  • 6082, 6097, 6112, 6127, 6142, 6172, 6187, 6202, 6217, 6232, 6247, 6262, 6292, 6427
  • 6081, 6096, 6111, 6126, 6141, 6171, 6186, 6201, 6216, 6231, 6246, 6261, 6291, 6426
  • 6080, 6095, 6110, 6125, 6140, 6170, 6185, 6200, 6215, 6230, 6245, 6260, 6290, 6425
  • 6079, 6094, 6109, 6124, 6139, 6169, 6184, 6199, 6214, 6229, 6244, 6259, 6289, 6424
  • 6078, 6093, 6108, 6123, 6138, 6168, 6183, 6198, 6213, 6228, 6243, 6258, 6288, 6423
  • 6077, 6092, 6107, 6122, 6137, 6167, 6182, 6197, 6212, 6227, 6242, 6257, 6287, 6422
  • 6076, 6091, 6106, 6121, 6136, 6166, 618
library(data.table)
WorldBank_data_wide = dcast(WorldBank_long, date + iso3c ~indicator, value.var = 'value')

Error: Aggregation function missing: defaulting to length --- The columns were what I wanted BUT instead of the values, the values were 0 or 1.

EDITS: Current Table

iso3c    date        value    indicatorID
1   ARB 2019    6437.167    NY.GDP.PCAP.KD
2   ARB 2018    6465.474    NY.GDP.PCAP.KD
3   ARB 2017    6454.460    NY.GDP.PCAP.KD
4   ARB 2016    6506.271    NY.GDP.PCAP.KD
5   ARB 2015    6418.029    NY.GDP.PCAP.KD
6   ARB 2014    6350.838    NY.GDP.PCAP.KD
7   ARB 2013    6333.027    NY.GDP.PCAP.KD
8   ARB 2012    6274.111    NY.GDP.PCAP.KD
9   ARB 2011    6020.487    NY.GDP.PCAP.KD
10  ARB 2010    5934.291    NY.GDP.PCAP.KD
11  ARB 2009    5806.185    NY.GDP.PCAP.KD

What I want to do -- only instead of 0 and 1 to have the values from above (this is my output from (data.table):

     date iso3c CPTOTSAXMZGY NY.GDP.PCAP.KD SI.POV.GINI UNEMPSA_
1   2005    ABW 0   1   0   0
2   2005    AFG 0   1   0   0
3   2005    AGO 0   1   0   0
4   2005    ALB 0   1   1   0
5   2005    AND 0   1   0   0
6   2005    ARB 0   1   0   0
7   2005    ARE 0   1   0   0
8   2005    ARG 0   1   1   1
9   2005    ARM 0   1   1   1
10  2005    ASM 0   1   0   0
11  2005    ATG 0   1   0   0

EDIT 2

The new command looks like it is better than all other output - but all my values (i.e. GDP per capita or Gini) now are repeated so that all other columns are NA except for the one that is being defined. But scrolling down then GDP becomes NA and Gini has its values.

#using new command:

WorldBank_data_wide<-WorldBank %>%   
 mutate(row = row_number()) %>%   pivot_wider(names_from = indicatorID, values_from = value)

#output
  iso3c date row.  NY.GDP.PCAP.KDSI.POV.GINIUNEMPSA_CPTOTSAXMZGY
2138    BDI 2019    1126    208.0747    NA  NA  NA
2139    BDI 2018    1127    210.8042    NA  NA  NA
2140    BDI 2017    1128    214.1392    NA  NA  NA
2141    BDI 2016    1129    219.9615    NA  NA  NA
2152    BDI 2005    1140    221.0964    NA  NA  NA
2151    BDI 2006    1139    225.6222    NA  NA  NA
2150    BDI 2007    1138    225.8591    NA  NA  NA
2142    BDI 2015    1130    228.4325    NA  NA  NA
2149    BDI 2008    1137    229.1485    NA  NA  NA
2148    BDI 2009    1136    230.1926    NA  NA  NA
2592    ETH 2005    1580    233.9442    NA  NA  NA

Edit:

Another unforeseen issue - Some commands are working but instead of each row being a unique country and year - some are repeated so that there are multiple rows for Argentina in 2013 (for example)

World_bank_wide <- WorldBank %>%
  group_by(iso3c,indicatorID) %>%
  mutate(row_id=1:n()) %>% ungroup() %>%
  spread(indicatorID, value) %>%
  select(-row_id)

row iso3c date CPTOTSAXMZGY NY.GDP.PCAP.KD SI.POV.GINI UNEMPSA_
1   ABW 2005    NA  26979.8854  NA  NA
104 ARE 2011    NA  34634.862   NA  NA
105 ARE 2012    NA  35416.892   NA  NA
106 ARE 2013    NA  36978.833   NA  NA
107 ARE 2014    NA  NA  32.5    NA
108 ARE 2014    NA  38495.046   NA  NA
109 ARE 2015    NA  40247.747   NA  NA
110 ARE 2016    NA  41045.111   NA  NA
111 ARE 2017    NA  41460.283   NA  NA

  • Can you show what is your expected output? Heads only. – UseR10085 Jul 15 '20 at 09:30
  • Try this `WorldBank_data_wide<-WorldBank %>% mutate(row = row_number()) %>% pivot_wider(names_from = indicatorID, values_from = value)`. Can you show the output of `sessionInfo()`? – UseR10085 Jul 15 '20 at 09:36
  • Hello! Thank you so much for replying! Is there a way to show expected output without copying and pasting a tibble from R? – Mary Claire Evans Jul 15 '20 at 09:37
  • Visit [this](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – UseR10085 Jul 15 '20 at 09:39
  • sessionInfo()R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.5 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 – Mary Claire Evans Jul 15 '20 at 09:39
  • Thank you @BappaDas! I have added the current and expected output! – Mary Claire Evans Jul 15 '20 at 09:44

4 Answers4

0

Use the following code

library(tidyverse)

WorldBank_data_wide<-WorldBank %>% 
  mutate(row = row_number()) %>%
  pivot_wider(names_from = indicatorID, values_from = value)
UseR10085
  • 7,120
  • 3
  • 24
  • 54
  • Thank you so much for all your help! Something happens with this code - my new data frame "WorldBank_data_wide" becomes a vector that is empty! – Mary Claire Evans Jul 15 '20 at 10:02
  • Remove this `%>% write.csv("try.csv")` – UseR10085 Jul 15 '20 at 10:03
  • Thank you - removing the write.csv did help! However, the same issue is occurring where instead of one row for the country (iso3c) and year (date) there are multiple rows for each country and date for each new variable (and all other values are NA for that row) -- also oddly the number of observations goes from 6431 to 5519. – Mary Claire Evans Jul 15 '20 at 10:15
0

Some searching led to this solution:

 library(wbstats)
 WorldBank_long <- wb(indicator = c("NY.GDP.PCAP.KD", "SI.POV.GINI", "UNEMPSA_","CPTOTSAXMZGY" ),
                      startdate = 2005, enddate = 2019)
 WorldBank <- WorldBank_long[,c("iso3c", "date", "value", "indicatorID")]
 World_bank_wide <- WorldBank %>%
   group_by(iso3c,indicatorID) %>%
   mutate(row_id=1:n()) %>% ungroup() %>%
   spread(indicatorID, value) %>%
   select(-row_id)

Output:

> World_bank_wide[1:3,]
# A tibble: 3 x 6
  iso3c date  CPTOTSAXMZGY NY.GDP.PCAP.KD SI.POV.GINI UNEMPSA_
  <chr> <chr>        <dbl>          <dbl>       <dbl>    <dbl>
1 ABW   2005            NA         26980.          NA       NA
2 ABW   2006            NA         27046.          NA       NA
3 ABW   2007            NA         27428.          NA       NA

I don't know how to post the output, but it is the same as you want as you listed above.

Edit: figured out how to post output. I was using RStudio which automatically fills outputs into tables. Instead, putting it into the console and copy/pasting the output worked.

Justin Cocco
  • 392
  • 1
  • 6
  • This is also so helpful! thank you! something happens though and there are some countries and years that are repeated - ideally, I want only one row with ARG 2011 - but what is happening is there are multiple rows for each of my values. – Mary Claire Evans Jul 15 '20 at 10:04
  • Hm interesting. I'll see if I can help with that too. One thing you can help me with is: how the heck do you get the output printed out? I feel like I'm going crazy. – Justin Cocco Jul 15 '20 at 10:10
  • I am copying and pasting my output from view(WorldBank_data_long) and keeping it within the ``` (bc without doing that formatting is all off!) – Mary Claire Evans Jul 15 '20 at 10:12
  • Ah! Of course its such a simple solution lol. Thank you! – Justin Cocco Jul 15 '20 at 10:31
0

Based on some of your edits above, I think you may be looking for something like this. First you are creating a variable id, so that each row has a unique value (otherwise, you would return an error while using pivot_wider). Then, you are spreading your data from long to wide using pivot_wider, and then dropping the id variable. Finally, you're only selecting unique row values (dropping duplicates), and then sorting by iso3c and date.

Edit: Since you need to aggregate across rows, where NA values span your numeric columns, you can summarise where values are not NA to get one row per iso3c per date. Then you can replace your NA numeric values with 0.

library(wbstats)
library(tidyverse)

WorldBank_long <- wb(indicator = c("NY.GDP.PCAP.KD", "SI.POV.GINI", "UNEMPSA_","CPTOTSAXMZGY" ),
                     startdate = 2005, enddate = 2019)
WorldBank <- WorldBank_long[,c("iso3c", "date", "value", "indicatorID")]


WorldBank %>% 
  mutate(id = row_number()) %>% 
  pivot_wider(names_from = indicatorID, values_from = value) %>%
  select(-id) %>% 
  distinct() %>% 
  arrange(iso3c, date) %>% 
  group_by(iso3c, date) %>% 
  mutate_all(as.character) %>% 
  summarise(across(.cols = everything(), .fns = ~ .[!is.na(.)][1])) %>% 
  mutate_at(c("NY.GDP.PCAP.KD","SI.POV.GINI", "UNEMPSA_", "CPTOTSAXMZGY"), as.numeric) %>% 
  mutate(across(where(is.numeric), ~replace_na(., 0)))

This gives you a unique value for each iso3c and date, and fills all NA values with 0:

iso3c date  NY.GDP.PCAP.KD SI.POV.GINI UNEMPSA_ CPTOTSAXMZGY
   <chr> <chr>          <dbl>       <dbl>    <dbl>        <dbl>
 1 ARG   2005           8578.        48      11.6             0
 2 ARG   2006           9175.        46.7    10.2             0
 3 ARG   2007           9902.        46.6     8.45            0
 4 ARG   2008          10201.        45.3     7.86            0
 5 ARG   2009           9502.        44.1     8.67            0
 6 ARG   2010          10386.        44.5     7.75            0
 7 ARG   2011          10883.        42.7     7.15            0
 8 ARG   2012          10650.        41.4     7.21            0
 9 ARG   2013          10785.        41       7.08            0
10 ARG   2014          10399.        41.7     7.27            0
11 ARG   2015          10568.         0       6.61            0
12 ARG   2016          10239.        42       8.47            0
13 ARG   2017          10404.        41.2     8.35            0
14 ARG   2018          10044.        41.4     9.24            0
15 ARG   2019           9729.         0       0               0
Matt
  • 7,255
  • 2
  • 12
  • 34
  • Thank you for your help! The one thing that has happened - when there are countries (iso3c) with multiple values in a year (date) instead of one row that represents one year of one country - there becomes multiple rows one for each column with a new value. For example Argentina (ARG) in 2005 has a value for GDP, Gini and Unemployment so in the new data frame there is one line with ARG 2005 with value for GDP and 0 for Gini and Unemployment, then another row with ARG 2005 with the value for Gini but 0 for GDP and Unemployment, etc. – Mary Claire Evans Jul 15 '20 at 10:31
  • Perhaps - I need to first make individual data sets for each indicator ID - then MERGE them together? – Mary Claire Evans Jul 15 '20 at 10:33
  • @MaryClaireEvans I updated the post so you can aggregate the rows – Matt Jul 15 '20 at 10:53
  • Wow! thank you so very much for your help! This worked :) – Mary Claire Evans Jul 15 '20 at 11:04
  • @MaryClaireEvans great! happy to help. – Matt Jul 15 '20 at 11:08
0

I was able to get the dataset to be what I was looking for by merging each individual dataset one by one instead of trying to move from long to wide format.

library(wbstats)

nations<-wbcountries()

GDP <- wb(indicator = c("NY.GDP.PCAP.KD"),
                     startdate = 2005, enddate = 2019)
GDP <- GDP[,c("iso3c", "date", "value")]
names(GDP)
names(GDP)<-c("iso3c", "date", "GDP.PC")

#example of new dataset:    
   iso3c date  GDP.PC
1   ARB 2019    6437.167
2   ARB 2018    6465.474
3   ARB 2017    6454.460
4   ARB 2016    6506.271
5   ARB 2015    6418.029


Gini <- wb(indicator = c("SI.POV.GINI"),
                     startdate = 2005, enddate = 2019)
Gini <- Gini[,c("iso3c", "date", "value")]
names(Gini)
names(Gini)<-c("iso3c", "date", "Gini")

Unemp <- wb(indicator = c("UNEMPSA_"),
           startdate = 2005, enddate = 2019)
Unemp <- Unemp[,c("iso3c", "date", "value")]
names(Unemp)
names(Unemp)<-c("iso3c", "date", "UnemploymentRate")

Rate <- wb(indicator = c("CPTOTSAXMZGY" ),
                     startdate = 2005, enddate = 2019)
Rate <- Rate[,c("iso3c", "date", "value")]
names(Rate)
names(Rate)<-c("iso3c", "date", "Rate")

World_bank_wide <- merge(GDP, Gini, by = c("iso3c", "date"))
World_bank_wide <- merge(World_bank_wide, Unemp, by = c("iso3c", "date"))

#final output
    
   iso3c date    GDP.PC     Gini   UnemploymentRate
1   ARG 2005    8577.865    48.0    11.5604135
2   ARG 2006    9174.502    46.7    10.1507404
3   ARG 2007    9901.512    46.6    8.4456562
4   ARG 2008    10201.476   45.3    7.8568969
5   ARG 2009    9502.244    44.1    8.6676318
6   ARG 2010    10385.964   44.5    7.7457817
7   ARG 2011    10883.315   42.7    7.1539098

  • Great if you have found a solution, but it seems that you're missing a column, and a lot of `iso3c` values, such as ABW, AFG, AGO, ALB, and many others. – Matt Jul 15 '20 at 11:03