1

I am tasked with manipulating data obtained from 1258 unique surveys.

In terms of dimensions. 28 million individual observations (including NA) -8 columns (variables). object name : dat

The column/variable I am particularly interested in is education (edu). I want to get the length of NA and Non-NA values (for edu) for those studies by aggregating (data$edu ~ id_study)

This are the first five entries for the 8 columns, I want to keep id_study

So far, I have used this code to work out the number of studies which contain at least 1 or more entries on edu.

numbers <- aggregate(dat$edu ~ dat$id_study, data=dat, FUN=length)

result from aggregate

I have the result I need for quantifying the numbers of unique id_study that have data on edu. This ticks box one.

Now I need to do the same for the unique id_study that have nothing at all on education. How do i do this?

I've tried so many codes to work out the length of NAs for studies that do not have anything on edu.

aggregate_2 <- aggregate(dat$edu ~ id_study, data=dat, FUN=length(dat[!is.na(dat)]))

this does not work :(

Can anyone shed some light on this matter please?

thank you

EDIT ****** Just to clarify if i was not clear in my question. There are 1258 unique surveys/studies,(some surveys may be for multiple years, e.g ALB_2013 and ALB_2014 under id_study).

Out of these surveys, using equation 1 code and the code i put in the description, code 1, I worked out that 530 of these 1258 surveys provided >=1 individual observation under the edu column.

This must mean 728 Unique surveys did not provide any information at all under the edu. I want to work out the names of the 728 surveys and using a function, hopefully want to work out the length of NAs per survey which didn't provide any information at all.

I hope this makes sense.

id_study (name of the survey) id (survey id) column i'm interested in is "edu".

Shivy b
  • 13
  • 3
  • 1
    Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). This will make it much easier for others to help you. – Sotos May 11 '18 at 09:49
  • I see two things: missing dat$ in front of id_study and the FUN argument is a function, you probably want to instead say `FUN = (function(x) {sum(!is.na(x))})` – Robin Gertenbach May 11 '18 at 10:45
  • @RobinGertenbach You actually don't need (and shouldn't use) any of the `dat$` prefixes inside `aggregate` as long as you provide a `data` argument. So `aggregate(edu ~ id_study, data = dat, FUN = length)` is fine. – Maurits Evers May 11 '18 at 10:59
  • oh yeah, I missed the data argument, thanks for pointing that out – Robin Gertenbach May 11 '18 at 11:06
  • Hey guys thank you both for the response, so which is it? – Shivy b May 11 '18 at 12:31
  • numbers <- aggregate(edu ~ id_study, data = dat, FUN = (function(dat) {sum(!is.na(dat))}) ? – Shivy b May 11 '18 at 12:31
  • @RobinGertenbach should this be the final code? – Shivy b May 11 '18 at 12:33
  • @MauritsEvers ? – Shivy b May 11 '18 at 12:33

1 Answers1

0

First off: Posting a screenshot of your data is bad practice, as it would require SO respondents to manually type in your sample data. Use dput to post (part of) your data. For future questions, follow the advice and links in Sotos' first comment!

That aside, how about the following:

numbers <- aggregate(
    edu ~ id_study, 
    data = dat, 
    FUN = function(x) c(n_nonNA = sum(!is.na(x)), n_NA = sum(is.na(x))))
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Sorry re screenshotting, this is the second time i've used this forum before and wasn't aware of the normal conventions. Thank you for letting me know. Secondly, the first equation I put in the initial post did the same thing as your equation above. However,,, What I would like to do is do the opposite; aggregate and return the number/length of unique id_study which have nothing at all on the edu variable. Many thanks – Shivy b May 11 '18 at 12:57
  • @Shivyb I'm not sure I understand; my code calculates the number of non-`NA` and `NA` rows. That seems to be consistent with your post *"I want to get the length of NA and Non-NA values (for edu)"*. If not, please edit your question to include sample data (using e.g. `dput`) and expected output. – Maurits Evers May 11 '18 at 13:14
  • Just to clarify if i was not clear in my question. There are 1258 unique surveys,(some surveys may be for multiple years, e.g ALB_2013 and ALB_2014 under id_study) and 28 million individuals who have taken part in this survey. Out of these surveys, using your code and the code i put in the description, code [1], we worked out that 530 of these 1258 surveys provided >=1 individual observation under the edu column. This must mean 728 Unique surveys did not provide any information at all. I want to work out the names of the 728 using aggregate and the length of NA values under education for these – Shivy b May 11 '18 at 14:39
  • We were able to do this for those surveys that provided more than one observation, we should be able to do this for those that provided nothing at all many thanks – Shivy b May 11 '18 at 14:40
  • @Shivyb You *must* edit your question to include sample data and expected output! Don't post critical information in comments. Review the information in Sotos comment! I have no idea what you mean. – Maurits Evers May 11 '18 at 15:02