-1

First time asking a question here, sry if I aren't clear enough

Here's my data:

df <- data.frame(Year=c("2018","2018","2019","2019","2018","2018","2019","2019"),Area=c("CF","CF","CF","CF","NY","NY","NY","NY"), Birth=c(1000,1100,1100,1000,2000,2100,2100,2000),Gender= c("F","M","F","M","F","M","F","M"))
df

#    Year Area Birth Gender
# 1  2018  CF  1000    F
# 2  2018  CF  1100    M
# 3  2019  CF  1100    F
# 4  2019  CF  1000    M
# 5  2018  NY  2000    F
# 6  2018  NY  2100    M
# 7  2019  NY  2100    F
# 8  2019  NY  2000    M

where birth is the new babies born..

What I want to do is creates a classification model where it predicts how likely a new born baby would be a male/female, with area/year as predictor.

yes I know it should be linear regression with Y as birth, X as others, however I just somehow fall into this situation.

With the given data, I already know the results as 50% of an observation being male and 50% of an observation being female. What I want to know is the probability of a baby being male/female, not which observation(row) being male/female which I already knows.

Is their a way that I can make birth as observation which is 1000+1100+1100+1000+2000+2100+2100+2000=12400 rows of data? which would be something like 1st observation is a 2018 born female baby from CF, 2nd observation is a 2018 born male baby from CF. With 12400 of it.

Or any suggestion to deal with this?

Zi Tee
  • 11

3 Answers3

3

We may use uncount

library(dplyr)
library(tidyr)
df %>% 
    uncount(Birth) %>%
    as_tibble

-output

# A tibble: 12,400 x 3
   Year  Area  Gender
   <chr> <chr> <chr> 
 1 2018  CF    F     
 2 2018  CF    F     
 3 2018  CF    F     
 4 2018  CF    F     
 5 2018  CF    F     
 6 2018  CF    F     
 7 2018  CF    F     
 8 2018  CF    F     
 9 2018  CF    F     
10 2018  CF    F     
# … with 12,390 more rows

Or using base R

transform(df[rep(seq_len(nrow(df)), df$Birth),], Birth = sequence(df$Birth))
akrun
  • 874,273
  • 37
  • 540
  • 662
2

You could use dplyr and summarize:

library(tidyverse)

df_expanded <- df %>% 
  group_by(Year, Area, Gender) %>% 
  summarize(expanded = 1:Birth)

# A tibble: 12,400 x 4
# Groups:   Year, Area, Gender [8]
   Year  Area  Gender expanded
   <chr> <chr> <chr>     <int>
 1 2018  CF    F             1
 2 2018  CF    F             2
 3 2018  CF    F             3
 4 2018  CF    F             4
 5 2018  CF    F             5
 6 2018  CF    F             6
 7 2018  CF    F             7
 8 2018  CF    F             8
 9 2018  CF    F             9
10 2018  CF    F            10
# … with 12,390 more rows
jdobres
  • 11,339
  • 1
  • 17
  • 37
1

Uncount is without a doubt the best solution for this problem. One alternative to the solutions shown could be

library(dplyr)
library(tidyr)

df %>% 
  mutate(Birth = lapply(Birth, function(n) 1:n)) %>% 
  unnest(Birth)

This returns

# A tibble: 12,400 x 4
   Year  Area  Birth Gender
   <chr> <chr> <int> <chr> 
 1 2018  CF        1 F     
 2 2018  CF        2 F     
 3 2018  CF        3 F     
 4 2018  CF        4 F     
 5 2018  CF        5 F     
 6 2018  CF        6 F     
 7 2018  CF        7 F     
 8 2018  CF        8 F     
 9 2018  CF        9 F     
10 2018  CF       10 F     
# ... with 12,390 more rows
Martin Gal
  • 16,640
  • 5
  • 21
  • 39