Efficiently reshaping big datasets

Question

The World Development Indicator looks as follows

library(data.table)
WDI <- fread("CountryName   CountryCode IndicatorName   IndicatorCode   1960    1961    2017
ArabWorld   ARB A   FX.OWN.TOTL.ZS  37.16521072 37.16521072 37.16521072
ArabWorld   ARB B   FX.OWN.TOTL.FE.ZS   25.63540268 25.63540268 25.63540268
ArabWorld   ARB C   FX.OWN.TOTL.MA.ZS   48.32851791 48.32851791 48.32851791
ArabWorld   ARB D   FX.OWN.TOTL.OL.ZS   42.54204559 42.54204559 42.54204559
ArabWorld   ARB E   FX.OWN.TOTL.40.ZS   27.72478104 27.72478104 27.72478104
ArabWorld   ARB F   FX.OWN.TOTL.PL.ZS   26.45811081 26.45811081 26.45811081
ArabWorld   ARB G   FX.OWN.TOTL.60.ZS   43.44695282 43.44695282 43.44695282
ArabWorld   ARB H   FX.OWN.TOTL.SO.ZS   48.66697693 48.66697693 48.66697693
ArabWorld   ARB I   FX.OWN.TOTL.YG.ZS   20.95479965 20.95479965 20.95479965
", header = TRUE)

I used the following code to reshape the World Development Indicator Database from the World Bank.

library(dplyr)
library(tidyr)

WDI <- WDI %>% 
  select(-`Indicator Name`) %>% 
  gather(Year, val,`1960`:`2017`) %>% 
  spread(`Indicator Code`, val)

It used to work without any issues, but for some reason it now requires too much memory for the operation to be completed.

I have tried to remove all other databases from the work space, gc(), closing all other programs on the computer and to reduce the years used to reshape, but that did not fix the problem. As a result, I wonder whether there is a less memory intensive way to deal with the issue.

EDIT 1: According to this post dcast.data.table or reshape (because of not running out of memory) is the way to go. I am however having a lot of trouble rewriting the syntax (I got the dplyr answer my posting a question as well) How would I rewrite the dplyr code for using dcast/reshape?

How do the terms select, gather, spread, relates to dcast and reshape?

EDIT 2: I have first tried to melt the data.table as follows:

WDI = melt(WDI, id.vars = c("IndicatorCode", "CountryName", "CountryCode"),
               #  measure.vars = -c("IndicatorCode", "CountryName", "CountryCode", "IndicatorName"))
                measure.vars = c("1960", "1961","2017"))
colnames(WDI)[4] <- "year"
WDI = dcast(WDI, CountryName + CountryCode + year ~ IndicatorCode, value.var="value")

But then I get the "warning" Aggregation function missing: defaulting to length and all entries are just 1 instead of the values. This apparently happens when the combination of entries are not unique. I am however pretty sure that they are (the combination of the country and the Indicator, should make the entry unique).

Seems relevant: [What methods can we use to reshape VERY large data sets?](https://stackoverflow.com/questions/55077668/what-methods-can-we-use-to-reshape-very-large-data-sets) — markus, Jul 08 '19 at 09:36
please help to clarify what is the gap in your understanding between the referenced post and applying it to your data set — MichaelChirico, Jul 08 '19 at 09:53
I simply don't understand the how to move from the dplyr syntax to the dcast/reshape.. Or adapting the example post given. I'm having trouble exactly explaining why.. I just don't see it.. — Tom, Jul 08 '19 at 10:28
I can't read your data into 7 columns. Please make your example reproducible. — David Arenburg, Jul 08 '19 at 22:28
It still reads as two columns, see `dim(WDI)`. You have spaces in the values which confuses `fread` — David Arenburg, Jul 09 '19 at 07:32
Try something like `dcast(melt(WDI, id = c(1:2, 4), measure = patterns("\\d+")), ... ~ IndicatorCode)` . Or you can pipe it if it easier to read: `melt(WDI, id = c(1:2, 4), measure = patterns("\\d+")) %>% dcast(., ... ~ IndicatorCode)` — David Arenburg, Jul 09 '19 at 08:39
@DavidArenburg Wow, so that worked quick snap. Thank you so much. No hurry, but I would be really interested in the explanation behind this.. — Tom, Jul 09 '19 at 08:51
The `melt` part is exactly like you did, I just used two shortcuts: First, I've used columns locations instead of their names. Second- I've used the `patterns` (see `?patterns`) function in order to get all the columns I want to measure at once (`\\d+` is just a lazy regex for taking the columns with numeric titles). In `dcast` I also did something very similar to your attempt but with yet another shortcut- `...` just means "take all the columns". data.table figures by itself that it needs to use `value` as the column for aggregation. You don't really need all these shortcuts- I'm just lazy. — David Arenburg, Jul 09 '19 at 08:59

CJ Yetman · Accepted Answer · 2019-07-09T09:13:19.280

The WDI dataset is not particularly large, so I suspect that your gather command is excluding columns that are unique per row, causing an enormous amount of duplication, e.g. an extra year column like 2018.

You could be more explicit with your select command to avoid that by positively selecting only the columns you need rather than negatively excluding specific columns, like so...

library(readr)
library(dplyr)
library(tidyr)

url <- "http://databank.worldbank.org/data/download/WDI_csv.zip"
zippath <- tempfile(fileext = ".zip")
download.file(url, zippath)

csvpath <- unzip(zippath, files = "WDIData.csv", exdir = tempdir())

WDI <- readr::read_csv(csvpath)

WDI %>% 
  select(`Country Name`, `Country Code`, `Indicator Code`, `1960`:`2017`) %>% 
  gather(Year, val,`1960`:`2017`) %>% 
  spread(`Indicator Code`, val)

or you could make sure that columns that you do not need in the reshape are excluded, like so...

library(readr)
library(dplyr)
library(tidyr)

url <- "http://databank.worldbank.org/data/download/WDI_csv.zip"
zippath <- tempfile(fileext = ".zip")
download.file(url, zippath)

csvpath <- unzip(zippath, files = "WDIData.csv", exdir = tempdir())

WDI <- readr::read_csv(csvpath)

WDI %>% 
  select(-`Indicator Name`, -`2018`, -`X64`) %>% 
  gather(Year, val,`1960`:`2017`) %>% 
  spread(`Indicator Code`, val)

you can also avoid some intermediate size inflation by using gather's na.rm = TRUE option, which might speed things up a bit...

library(readr)
library(dplyr)
library(tidyr)

url <- "http://databank.worldbank.org/data/download/WDI_csv.zip"
zippath <- tempfile(fileext = ".zip")
download.file(url, zippath)

csvpath <- unzip(zippath, files = "WDIData.csv", exdir = tempdir())

WDI <- readr::read_csv(csvpath)

WDI %>% 
  select(-`Indicator Name`, -`2018`, -`X64`) %>% 
  gather(Year, val,`1960`:`2017`, na.rm = TRUE) %>% 
  spread(`Indicator Code`, val)

and for a bit more explanation, notice what happens if you "accidentally" don't include 2017 in your gather command using the example dataset you created above...

library(data.table)
WDI <- fread("CountryName   CountryCode IndicatorName   IndicatorCode   1960    1961    2017
ArabWorld   ARB A   FX.OWN.TOTL.ZS  37.16521072 37.16521072 37.16521072
ArabWorld   ARB B   FX.OWN.TOTL.FE.ZS   25.63540268 25.63540268 25.63540268
ArabWorld   ARB C   FX.OWN.TOTL.MA.ZS   48.32851791 48.32851791 48.32851791
ArabWorld   ARB D   FX.OWN.TOTL.OL.ZS   42.54204559 42.54204559 42.54204559
ArabWorld   ARB E   FX.OWN.TOTL.40.ZS   27.72478104 27.72478104 27.72478104
ArabWorld   ARB F   FX.OWN.TOTL.PL.ZS   26.45811081 26.45811081 26.45811081
ArabWorld   ARB G   FX.OWN.TOTL.60.ZS   43.44695282 43.44695282 43.44695282
ArabWorld   ARB H   FX.OWN.TOTL.SO.ZS   48.66697693 48.66697693 48.66697693
ArabWorld   ARB I   FX.OWN.TOTL.YG.ZS   20.95479965 20.95479965 20.95479965
", header = TRUE)

library(dplyr)
library(tidyr)

WDI %>% 
  select(-`IndicatorName`) %>% 
  gather(Year, val,`1960`:`1961`) %>% 
  spread(`IndicatorCode`, val)

Very nice! Also the addition of showing how to automatically download the file. — Tom, Aug 01 '19 at 10:43

Efficiently reshaping big datasets

1 Answers1