First time asking a question here, sry if I aren't clear enough
Here's my data:
df <- data.frame(Year=c("2018","2018","2019","2019","2018","2018","2019","2019"),Area=c("CF","CF","CF","CF","NY","NY","NY","NY"), Birth=c(1000,1100,1100,1000,2000,2100,2100,2000),Gender= c("F","M","F","M","F","M","F","M"))
df
# Year Area Birth Gender
# 1 2018 CF 1000 F
# 2 2018 CF 1100 M
# 3 2019 CF 1100 F
# 4 2019 CF 1000 M
# 5 2018 NY 2000 F
# 6 2018 NY 2100 M
# 7 2019 NY 2100 F
# 8 2019 NY 2000 M
where birth is the new babies born..
What I want to do is creates a classification model where it predicts how likely a new born baby would be a male/female, with area/year as predictor.
yes I know it should be linear regression with Y as birth, X as others, however I just somehow fall into this situation.
With the given data, I already know the results as 50% of an observation being male and 50% of an observation being female. What I want to know is the probability of a baby being male/female, not which observation(row) being male/female which I already knows.
Is their a way that I can make birth as observation which is 1000+1100+1100+1000+2000+2100+2100+2000=12400 rows of data? which would be something like 1st observation is a 2018 born female baby from CF, 2nd observation is a 2018 born male baby from CF. With 12400 of it.
Or any suggestion to deal with this?