Advice on a loop function to subset data according to variables

Question

I have a dataset that has 453 variables (columns) and 119 observations (rows). It is comprised of 118 health observations for different countries over a number of years. For example, 10 of the 453 variables contain health data from Australia over a 10 year period; 8 of the 453 variables contain health data from Bangladesh over a 8 year period.

I want to subset these 453 variables into their own country-based data frames. The country name and year is in row 1 (e.g. Australia_2013, Australia_2014 etc). Seeing as though there are > 40 countries in this dataset, I would like to create a loop for this.

From what I've read so far, I think I should create a vector list of country names and then write a loop function that subsets data according to the vector list. All of the examples I can find are for subsetting based on rows however.

Can anyone point me in the right direction, or share example code for this?

Much thanks in anticipation

Please provide a small sample of your data. Making a solution to the question will be easier if we have a dataset to work with (and we can also see the data structure that you're working with). It doesn't have to be the actual data. Only include the columns that you think will be useful. [See here if you need help making a reproducible dataset as an example.](https://stackoverflow.com/a/5965451/6288065) — LC-datascientist, Dec 08 '20 at 13:50
If you find any of the answers below useful, please mark it as "solved". Thanks! — Pittoro, Dec 08 '20 at 15:17
@LC-datascientist - Apologies too for not giving an example dataset. The example from your answer below is a comparable version to what i have. — Rebecca, Dec 09 '20 at 03:25

score 0 · Answer 1 · answered Dec 08 '20 at 14:13

here is a dplyr answer, for version >= 1.0.

I created a small example, and we nest into the data column the different columns. Then since nest_by already created a rowwise grouped we can subset each data for the columns that starts with the country name. We need to convert this to a character. Finally, if needed you can pull the list-column subset to get a list of tibbles that contains the relevant columns. Of note, I think working with in a tidy format (long and not double info encoded (country and year) would be easier.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- data.frame(country = c("A", "B", "C"),
                 A_1 = 1:3,
                 A_2 = 3:5,
                 B_1 = 2:4,
                 C_3 = 1:3)
df
#>   country A_1 A_2 B_1 C_3
#> 1       A   1   3   2   1
#> 2       B   2   4   3   2
#> 3       C   3   5   4   3
nest_by(df, country) %>% 
  mutate(subset = list(select(data, starts_with(as.character(country))))) %>% 
  pull(subset)
#> [[1]]
#> # A tibble: 1 x 2
#>     A_1   A_2
#>   <int> <int>
#> 1     1     3
#> 
#> [[2]]
#> # A tibble: 1 x 1
#>     B_1
#>   <int>
#> 1     3
#> 
#> [[3]]
#> # A tibble: 1 x 1
#>     C_3
#>   <int>
#> 1     3

^{Created on 2020-12-08 by the reprex package (v0.3.0)}

Thanks for the help so far. Two brief follow ups: firstly, I had an error message: Error in nest_by(df, country) : could not find function "nest_by". dplyr attached in the same way yours did. Secondly, is this contingent on having a column for country name (as the data doesn't have this currently). — Rebecca, Dec 09 '20 at 02:49
as stated, you need `dplyr` version **>=1.0** to get `nest_by`. and yes, first step is to have a column with the country description, I though you had it. Then i would go with a `pivot_longer` approach — aurelien, Dec 09 '20 at 08:16

LC-datascientist · Accepted Answer · 2020-12-09T08:35:54.447

Based on your description, I assume your data looks something like this:

country_year <- c("Australia_2013", "Australia_2014", "Bangladesh_2013")
health <- matrix(nrow = 3, ncol = 3, data = runif(9))
dataset <- data.frame(rbind(country_year, health), row.names = NULL, stringsAsFactors = FALSE)

dataset
#                 X1                X2                 X3
#1    Australia_2013    Australia_2014    Bangladesh_2013
#2 0.665947273839265 0.677187719382346  0.716064820764586
#3 0.499680359382182 0.514755881391466  0.178317369660363
#4 0.730102791683748 0.666969108628109 0.0719663293566555

First, move your row 1 (e.g., Australia_2013, Australia_2014 etc.) to the column names, and then apply the loop to create country-based data frames.

library(dplyr)

# move header
dataset2 <- dataset %>% 
    `colnames<-`(dataset[1,]) %>%  # uses row 1 as column names
    slice(-1) %>% # removes row 1 from data
    mutate_all(type.convert) # converts data to appropriate type

# apply loop
for(country in unique(gsub("_\\d+", "", colnames(dataset2)))) {
    assign(country, select(dataset2, starts_with(country))) # makes subsets
}

Regarding the loop,

gsub("_\\d+", "", colnames(dataset2)) extracts the country names by replacing "_[year]" with nothing (i.e., removing it), and the unique() function that is applied extracts one of each country name.

assign(country, select(dataset2, starts_with(country))) creates a variable named after the country and this country variable only contains the columns from dataset2 that start with the country name.

Edit: Responding to Comment

The question in the comment was asking how to add row-wise summaries (e.g., rowSums(), rowMeans()) as new columns in the country-based data frames, while using this for-loop.

Here is one solution that requires minimal changes:

for(country in unique(gsub("_\\d+", "", colnames(dataset2)))) {
    assign(country, 
        select(dataset2, starts_with(country)) %>% # makes subsets
            mutate( # creates new columns
                rowSums = rowSums(select(., starts_with(country))),
                rowMeans = rowMeans(select(., starts_with(country)))
            )
    )
}

mutate() adds new columns to a dataset.

select(., starts_with(country)) selects columns that start with the country name from the current object (represented as . in the function).

Thanks so much! I could integrate this with my existing dataset very well. I really appreciate the explanations too. — Rebecca, Dec 09 '20 at 03:20
Can I ask a smaller follow up question - I want to create a summary column for each new data subset - calculating the total and mean across years. Using the Australia example from dataset2, I have applied this code: i <- c(1:2) Australia<-cbind(Australia, rowSums=rowSums(Australia[ , i], na.rm = TRUE),rowMeans=rowMeans(Australia[ , i], na.rm = TRUE)). Is this possible to add into the existing loop? — Rebecca, Dec 09 '20 at 03:22
@Rebecca - Yes, I appended the answer to address your question. — LC-datascientist, Dec 09 '20 at 08:36

Pittoro · Answer 3 · 2020-12-08T15:28:45.597

First of all, the data structure is not optimal, having 1st row as string means, numbers in all other rows (in each column) are also coded by R as strings. But that's not part of the question.

You cannot create a series of data frames, but you can if they are part of a list (that's what lists are in R for!), with one list element holding single country.

Pure base R approach, solution with a working example:

# example dataset df
data("mtcars")
df <- mtcars
df <- rbind(paste0(sample(letters, ncol(df), replace = TRUE), "_2014"), df)
str(df)

# solution
countries <- substr(df[1, ], 1, nchar(df[1, ]) - 5)
unique_countries <- unique(countries)
df <- rbind.data.frame(countries, df, stringsAsFactors = FALSE)

list_df_per_country <- list()

for (i in seq_along(unique_countries)) {
  list_df_per_country[[i]] <- df[which(df[1, ] == unique_countries[i])]
}

To use the code above, just save your dataframe as df, i.e. df <- your_dataframe, and run lines below the # solution, one by one.

Advice on a loop function to subset data according to variables

3 Answers3