0

I feel like this is a really simple question, and I've looked a lot of places to try to find an answer to it, but everything seems to be looking to do a lot more than what I want--

I have a dataset that has multiple observations from multiple participants. One of the factors is where they're from (e.g. Buckinghamshire, Sussex, London). I want to combine everything that isn't London so I have two categories that are London and notLondon. How would I do this? I'd them want to be able to run a lm on these two, so how would I edit my dataset so that I could do lm(fom ~ [other factor]) where it would be the combined category?

Also, how would I combine all observations from each respective participant for a category? e.g. I have a category that's birth year, but currently when I do a summary of my data it will say, for example, 1996:265, because there are 265 observations from people born in '96. But I just want it to tell me how many participants were born in 1996.

Thanks!

  • Welcome to SO! For the second part (and if your data is not too big), I'd suggest to check out [`dplyr`](https://dplyr.tidyverse.org/), especially the `summary` function – starja Feb 10 '22 at 17:04
  • 1
    Can you clarify what you are trying to do with the linear regression? What is the `y` variable you're trying to predict and what are the predictor(s) you're hoping to use? It will help if you can share some of your data using `dput()`. See [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for guidance on how to share reproducible examples to get the best help. – Dan Adams Feb 10 '22 at 17:15
  • Please provide enough code so others can better understand or reproduce the problem. – Community Feb 18 '22 at 11:09

1 Answers1

0

There are multiple parts to your question so let's take it step by step.

1.

For the first part this is a great use of tidyr::fct_collapse(). See example here:

library(tidyverse)

set.seed(1)
d <- sample(letters[1:5], 20, T) %>% factor()

# original distribution
table(d)
#> d
#> a b c d e 
#> 6 4 3 1 6

# lumped distribution
fct_collapse(d, a = "a", other_level = "other") %>% table()
#> .
#>     a other 
#>     6    14

Created on 2022-02-10 by the reprex package (v2.0.1)


2.

For the second part, you will have to clarify and share some data to get more help.


3.

Here you can use dplyr::summarize(n = n()) but you need to share some data to get an answer with your specific case.

However something like:

df %>% group_by(birth_year) %>% summarize(n = n())

will give you number of people with that birth year listed.

Dan Adams
  • 4,971
  • 9
  • 28