I am tasked with manipulating data obtained from 1258 unique surveys.
In terms of dimensions. 28 million individual observations (including NA) -8 columns (variables). object name : dat
The column/variable I am particularly interested in is education (edu). I want to get the length of NA and Non-NA values (for edu) for those studies by aggregating (data$edu ~ id_study)
So far, I have used this code to work out the number of studies which contain at least 1 or more entries on edu.
numbers <- aggregate(dat$edu ~ dat$id_study, data=dat, FUN=length)
I have the result I need for quantifying the numbers of unique id_study that have data on edu. This ticks box one.
Now I need to do the same for the unique id_study that have nothing at all on education. How do i do this?
I've tried so many codes to work out the length of NAs for studies that do not have anything on edu.
aggregate_2 <- aggregate(dat$edu ~ id_study, data=dat, FUN=length(dat[!is.na(dat)]))
this does not work :(
Can anyone shed some light on this matter please?
thank you
EDIT ****** Just to clarify if i was not clear in my question. There are 1258 unique surveys/studies,(some surveys may be for multiple years, e.g ALB_2013 and ALB_2014 under id_study).
Out of these surveys, using equation 1 code and the code i put in the description, code 1, I worked out that 530 of these 1258 surveys provided >=1 individual observation under the edu column.
This must mean 728 Unique surveys did not provide any information at all under the edu. I want to work out the names of the 728 surveys and using a function, hopefully want to work out the length of NAs per survey which didn't provide any information at all.
I hope this makes sense.
id_study (name of the survey) id (survey id) column i'm interested in is "edu".