Combining/aggregating data in R

Question

I feel like this is a really simple question, and I've looked a lot of places to try to find an answer to it, but everything seems to be looking to do a lot more than what I want--

I have a dataset that has multiple observations from multiple participants. One of the factors is where they're from (e.g. Buckinghamshire, Sussex, London). I want to combine everything that isn't London so I have two categories that are London and notLondon. How would I do this? I'd them want to be able to run a lm on these two, so how would I edit my dataset so that I could do lm(fom ~ [other factor]) where it would be the combined category?

Also, how would I combine all observations from each respective participant for a category? e.g. I have a category that's birth year, but currently when I do a summary of my data it will say, for example, 1996:265, because there are 265 observations from people born in '96. But I just want it to tell me how many participants were born in 1996.

Thanks!

Welcome to SO! For the second part (and if your data is not too big), I'd suggest to check out [`dplyr`](https://dplyr.tidyverse.org/), especially the `summary` function — starja, Feb 10 '22 at 17:04
Can you clarify what you are trying to do with the linear regression? What is the `y` variable you're trying to predict and what are the predictor(s) you're hoping to use? It will help if you can share some of your data using `dput()`. See [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for guidance on how to share reproducible examples to get the best help. — Dan Adams, Feb 10 '22 at 17:15
Please provide enough code so others can better understand or reproduce the problem. — Community, Feb 18 '22 at 11:09

score 0 · Answer 1 · answered Feb 10 '22 at 17:11

There are multiple parts to your question so let's take it step by step.

1.

For the first part this is a great use of tidyr::fct_collapse(). See example here:

library(tidyverse)

set.seed(1)
d <- sample(letters[1:5], 20, T) %>% factor()

# original distribution
table(d)
#> d
#> a b c d e 
#> 6 4 3 1 6

# lumped distribution
fct_collapse(d, a = "a", other_level = "other") %>% table()
#> .
#>     a other 
#>     6    14

^{Created on 2022-02-10 by the reprex package (v2.0.1)}

2.

For the second part, you will have to clarify and share some data to get more help.

3.

Here you can use dplyr::summarize(n = n()) but you need to share some data to get an answer with your specific case.

However something like:

df %>% group_by(birth_year) %>% summarize(n = n())

will give you number of people with that birth year listed.

Combining/aggregating data in R

1 Answers1

1.

2.

3.