split dataframe into subsets based on character value

Question

I want to perform the same regression for 4 different socio-economic levels of the interviewed persons in the survey data.

for example:

educational_level (of subset 1) = ß0 + ß1*educational_level_father + ß2*race + ... +u 

educational_level (of subset 2)= ß0 + ß1*educational_level_father + ß2*race + ... +u

...and so on. How do I divide the data.frame based on the value of one specific variable (column) in it?

List of potential duplicates: [fit model to multiple groupings or subsets and extract original factor columns for data frame output](https://stackoverflow.com/questions/32119184/fit-model-to-multiple-groupings-or-subsets-and-extract-original-factor-columns-f), [Splitting data and fitting distributions efficiently](https://stackoverflow.com/questions/51328631/splitting-data-and-fitting-distributions-efficiently), [Fit a different model for each row of a list-columns data frame](https://stackoverflow.com/questions/41404198/fit-a-different-model-for-each-row-of-a-list-columns-data-frame). — 000andy8484, Aug 17 '18 at 13:04
You should note that Stack Overflow (SO) is not a code-writing service, but a question and answer site. Please take some time to read the help page, especially the sections named ["What topics can I ask about here?"](http://stackoverflow.com/help/on-topic) and ["What types of questions should I avoid asking?"](http://stackoverflow.com/help/dont-ask). And more importantly, please read [the Stack Overflow question checklist](http://meta.stackexchange.com/q/156810/204922). You might also want to learn about [Minimal, Complete, and Verifiable Examples](http://stackoverflow.com/help/mcve). — 000andy8484, Aug 20 '18 at 15:45

score 0 · Answer 1 · edited Aug 17 '18 at 08:38

One approach involves looping over the unique values in your subsetting column. Take a look at for and subset:

> data("iris")  ## A data set
> unique_species <- unique(iris$Species)  ## Get the unique values of the subsetting column
> results <- list()  ## Set up a list to store the regressions you will run within the loop
> for (species in unique_species) {  ## Loop over each unique value
+     data_subset <- subset(iris, iris$Species == species)  ## Subset based on the desired value
+     results[[species]] <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
                               data=data_subset)  ## Run each regression
+ }

This will produce:

> results
$setosa

Call:
lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
    data = data_subset)

Coefficients:
 (Intercept)   Sepal.Width  Petal.Length   Petal.Width  
      2.3519        0.6548        0.2376        0.2521  


$versicolor

Call:
lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
    data = data_subset)

Coefficients:
 (Intercept)   Sepal.Width  Petal.Length   Petal.Width  
      1.8955        0.3869        0.9083       -0.6792  


$virginica

Call:
lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
    data = data_subset)

Coefficients:
 (Intercept)   Sepal.Width  Petal.Length   Petal.Width  
      0.6999        0.3303        0.9455       -0.1698

For just 4 levels, this should be reasonably efficient.

000andy8484 · Answer 2 · 2018-08-20T15:44:41.287

A base-R solution would be:

dat.list <- split(x=YourData, f = as.factor(YourData$YourCharacter)
summary(lm(educ ~ educ_father, data=dat.list[[1]]))
summary(lm(educ ~ educ_father, data=dat.list[[2]]))
summary(lm(educ ~ educ_father, data=dat.list[[3]]))
summary(lm(educ ~ educ_father, data=dat.list[[4]]))

or, you could just assign the regression outcome to a list with a little for loop.

If you're looking for more efficient solutions (i.e. you have big data) you should implement a nest-map-unnest workflow. My personal preference to achieve this would be to rely on broom, purr, and dplyr packages, part of tidyverse. You can inspect some code from this vignette. Other solutions are of course possible.

split dataframe into subsets based on character value

2 Answers2