Properly modify a data set with dplyr

Question

I have a data set I modified a lot, to the point where the code doesn't look very clean and tidy, and I need some help in order to put everything in a clean dplyr style, this is my code:

ddd_dataset <- read_excel("data/ddd_dataset.xlsx")
new_data = ddd_dataset[ddd_dataset$`Indicator name`=="Population covered by at least a 4G mobile network (%)",]

new_data = new_data[order(new_data$Country),]
new_data = spread(new_data[-c(1553, 1554), c(1,5,6)], Year, value = Value)

# Data imputation

new_data = new_data %>% pivot_longer(-Country, names_to = "year") %>% 
  mutate(value = value %>% as.numeric()) %>% 
  group_by(Country) %>% 
  fill(value, .direction = "updown") %>% 
  pivot_wider(names_from = year, values_from = value)

# Change column

itu_emi_countries <- read_csv("data/itu-emi-countries.csv")
itu_emi_countries <- itu_emi_countries %>% rename(Country = `ITU Name`)
new_data = left_join(new_data, itu_emi_countries, by.x = "Country", by.y = "Country")
new_data$Country = new_data$`EMI Name`
new_data = new_data[,1:10]

# Turn data into long format

new_long = 
  new_data %>% 
  pivot_longer(-Country, names_to = "year", values_to = "x") %>% 
  mutate(across(year, as.numeric))

Does anyone know how I can rewrite these functions into a single function that has the style of a dplyr function (using %>%)?

Without being able to run any of your code—we don't have access to the files you're reading, or know anything about them—it's hard to know what exactly you would want a function to do. [See here](https://stackoverflow.com/q/5963269/5325862) on making a reproducible example that is easier for folks to help with. — camille, Aug 13 '22 at 19:31
The data set is too large to put a `dput()` here, and I don't know how to read an excel file from github, if you can give me a hint I could change the code so as to make it possible to download the files directly from GitHub. I just want a function to do all that has been done above in a tidy function. — Saïd Maanan, Aug 13 '22 at 20:16
@camille but you can still find all the data in here: https://github.com/smaanan/Euromonitor/tree/main/data — Saïd Maanan, Aug 13 '22 at 20:24
The link in my comment gives several ways to include a sample of data — camille, Aug 14 '22 at 21:23

score 1 · Accepted Answer · answered Aug 17 '22 at 12:11

Literal, with inference and caveats:

library(dplyr)
library(tidyr) # pivot_*, complete, fill
# library(readr)
# library(readxl)

ddd_dataset <- readxl::read_excel("ddd_dataset.xlsx")
itu_emi_countries <- readr::read_csv("itu-emi-countries.csv") %>%
  rename(Country = `ITU Name`)

new_data <- ddd_dataset %>%
  filter(`Indicator name` == "Population covered by at least a 4G mobile network (%)") %>%
  mutate(Value = suppressWarnings(as.numeric(Value))) %>%
  pivot_wider(Country, names_from = Year, values_from = Value) %>%
  # we cannot impute before here, since some countries do not have all years, but now they will
  pivot_longer(-Country, names_to = "Year", values_to = "Value") %>%
  arrange(Country, Year) %>%
  group_by(Country) %>%
  fill(Value, .direction = "updown") %>%
  pivot_wider(Country, names_from = Year, values_from = Value)

new_long <- left_join(new_data, itu_emi_countries, by = "Country") %>%
  # inferring that you want to keep names for countries in new_data not present in itu
  mutate(Country = coalesce(`EMI Name`, Country)) %>%
  # inferring you want all but `EMI Name`, not just hard-coding 1:10
  select(-`EMI Name`) %>%
  pivot_longer(-Country, names_to = "year", values_to = "x") %>%
  mutate(year = as.integer(year))

new_data
# # A tibble: 196 x 10
#    Country             `2012` `2013` `2014` `2015` `2016` `2017` `2018` `2019` `2020`
#    <chr>                <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#  1 Afghanistan            0      0        0    0     0       4      7     22     26  
#  2 Albania                0      0        0   35    80.2    85.3   85.5   95     98.4
#  3 Algeria                0      0        0    0     3.62   30.5   52.8   53.6   76.2
#  4 Andorra               50     50       50   50    50      85     85     85     85  
#  5 Angola                 7      7        7    7     8       8      8     18     30  
#  6 Antigua and Barbuda   65     78.6     80   98    99      99     99     99     99  
#  7 Argentina              0      0        0   65    85      85     90.8   91.2   97.7
#  8 Armenia               17.5   44       46   46.5  52.5    90.0   99.1   99.3  100  
#  9 Australia             52.2   85       95   94    98      99     99.2   99.4   99.5
# 10 Austria               31.6   58.4     85   98    98      98     98     98     98  
# # ... with 186 more rows

new_long
# # A tibble: 1,764 x 3
#    Country      year     x
#    <chr>       <int> <dbl>
#  1 Afghanistan  2012     0
#  2 Afghanistan  2013     0
#  3 Afghanistan  2014     0
#  4 Afghanistan  2015     0
#  5 Afghanistan  2016     0
#  6 Afghanistan  2017     4
#  7 Afghanistan  2018     7
#  8 Afghanistan  2019    22
#  9 Afghanistan  2020    26
# 10 Albania      2012     0
# # ... with 1,754 more rows

But it seems unnecessary and inefficient to pivot back and forth when you ultimately want it in long format in the end. One-step:

new_long2 <- ddd_dataset %>%
  filter(`Indicator name` == "Population covered by at least a 4G mobile network (%)") %>%
  left_join(itu_emi_countries, by = "Country") %>%
  mutate(
    Country = coalesce(`EMI Name`, Country),    # some `EMI Name` are missing
    Value = suppressWarnings(as.numeric(Value)) # "NULL" -> NA
  ) %>%
  complete(Country, Year) %>%
  arrange(Year) %>%
  group_by(Country) %>%
  fill(Value, .direction = "updown") %>%
  ungroup() %>%
  select(Country, year = Year, x = Value)

(The only difference in the data, other than order, is that Year is a numeric in this last block and is integer above. This can easily be remedied, over to you.)

Thank you, I thought nobody would answer. – Saïd Maanan Aug 18 '22 at 05:46 — Saïd Maanan, Aug 18 '22 at 05:46

Properly modify a data set with dplyr

1 Answers1