How to creat a new data set based on rows of one variable from an existing dataset while each row has multiple observations

Question

I have a dataset with the following structure:

Variable "Class" = 1,..,50 each class has multiple observations: from 2000 (#obs in class1) to 200(#obs in class 50) variables Age, Sex, HIV for each individual in each class

What I have to do is to create data from this original dataset in a way that each row shows the variable "Class" (50 rows on the other hand instead of something around 10000 rows that I have for the original dataset) and with the variables you see.

Im new to R, so Im not sure how I can squeeze(?!) the data in a way that for example row 1 shows class 1 but with the information of Age and Sex and HIV for 2000 individuals!

I need this new dataset because I am writing a function (a glm) and the source of data for that function should not be the original data, it should be based on classes! But the predictions of this glm will be on the individual level! (on the original data)

Can anyone kidnly give me a hand or hint on this?

Here is a mini-scale of data looks like:

library(simstudy)

Class <- defData(varname = "Class", dist = "categorical", formula = "0.8;0.2", id="Class1")

Class <- defData(Class, varname = "Classic", dist = "categorical", formula = "0.8;0.2")

Class <- defData(Class, varname = "clustersize",dist = "normal", formula = "5", variance = 0)

d1 <- genData(1, Class) #'
d1

dF1 <- genCluster(d1, cLevelVar = "Class", numIndsVar = "clustersize", level1ID = "Class1")
dF1

Class2<- defData(varname = "Class", dist = "categorical", formula = "0.3;0.2;0.1;0.3;0.1", id="Class1")

Class2 <- defData(Class2, varname = "Classic", dist = "categorical", formula = "0.3;0.2;0.1;0.3;0.1")

Class2 <- defData(Class2, varname = "clustersize",dist = "noZeroPoisson", formula = "3")


d2 <- genData(3, Class2) #'
d2

dF2 <- genCluster(d2, cLevelVar = "Class", numIndsVar = "clustersize", level1ID = "Class1")
dF2

d<-rbind(dF1,dF2)

v <- defDataAdd( varname = "Age", dist = "normal", formula = "20", variance = 10)

v <- defDataAdd(v, varname = "Sex", dist = "binary", formula = "0.4", link = "logit")

v <- defDataAdd(v, varname = "HIV", dist = "binary", formula = "0.7", link = "logit")

d <- addColumns(v, d)

Y<- defDataAdd( varname = "Y", dist = "binary", formula = "0.1*Age+0.2*Sex+0.5*HIV", link = "logit")

d <- addColumns(Y, d)

d

Let's put it this way. "d" is the original dataset I have, with 16 rows( individuals) according to the code I gave. Now I want to model Y by Age, Sex, HIV but the data that this model should be using, is not "d", it has to be a new data set extracting from "d" in a way that I end up with 3 rows (because I have 3 classes). So my confusion is how can I do that (create a new dataset from d) when I have 11 individuals in class 1, 2 individuals in class 2, 3 individuals in class 3. So I will run the model in this new data set, and will predict it in the original dataset "d"

Welcome hela. Can you add what you already tried? Try to create a MWE. — mharinga, Apr 25 '21 at 17:35
@mharinga Hi! Yes, I'm going to give a mini scale of the data I am referring to as a separate answer, however, I haven't tried any code yet because I have no idea how or what I should use! — hela, Apr 25 '21 at 17:55
Perfect. You can include your reproducible example by editing your question. — mharinga, Apr 25 '21 at 17:58
Like @mharinga was saying - a reproducible example - not sure what that is? Read about how to do that here: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Kat, Apr 25 '21 at 18:21
@mharinga done! :) Thank! Im new here so Im not sure if there was any specific area to put the codes! I just wrote them in the main text — hela, Apr 25 '21 at 18:24
@hela you can include code between three back ticks. So start your code with ``` and end your code with ```. — mharinga, Apr 25 '21 at 18:32
See the link given by @Kat for how to include a data set and how to create an example that is reproducible. — mharinga, Apr 25 '21 at 18:36
I get an error for: dF2 <- genCluster(d2, cLevelVar = "Class", numIndsVar = "clustersize", level1ID = "Class1") — mharinga, Apr 26 '21 at 06:51
@mharinga ignore it (it is just an example so the final d would give you the idea how the data looks like, also it goes away if you try to re-run the code couple of times, because of the probability it generates, they may end up the same so the values will be the same on multiple rows but that doesnt matter now, I just want to to see how the data looks like) — hela, Apr 26 '21 at 12:44

score 0 · Answer 1 · answered Apr 25 '21 at 19:19

0

Thanks for updating the question. However, I can't reproduce your example. The code gives an error. In case you would like to estimate a GLM, you can first create factors, and then fit the GLM. It is not clear to me what you mean by classes.

Let's say you have the following data mtcars, and would like to model cyl by vs and gear. Then you can first create factors for vs and gear, and then use the new data in a glm.

library(dplyr)

# Change vs and gear to factors
mtcars1 <- mtcars %>%
  mutate(across(c(vs,gear), as.factor))

Compare the following two:

glm(cyl ~ vs + gear, data = mtcars1)
glm(cyl ~ vs + gear, data = mtcars)

The first one uses factors and the second one numerical values.

answered Apr 25 '21 at 19:19

mharinga

1,708
10
23

Thanks! But that is not exactly what I was confused about. I edited the code once more, it should work now. But by classes maybe I can say "Clusters" but classes are basically a variable in this code, we would have 3 classes each with several individuals. – hela Apr 25 '21 at 20:25
my question is basically this: how can I create a dataset with 3 rows (because we have 3 classes) and the variables Age, Sex, HIV. My confusion is: How can we do that when each class has several individuals. – hela Apr 25 '21 at 20:30
Can you add the desired output to your question – mharinga Apr 25 '21 at 20:56
Ok! Let's put it this way. "d" is the original dataset I have, with 16 rows( individuals) according to the code I gave. Now I want to model Y by Age, Sex, HIV but the data that this model should be using, is not "d", it has to be a new data set extracting from "d" in a way that I end up with 3 rows (because I have 3 classes). So my confusion is how can I do that (create a new dataset from d) when I have 11 individuals in class 1, 2 individuals in class 2, 3 individuals in class 3. So I will run the model in this new data set, and will predict it in the original dataset "d" – hela Apr 25 '21 at 21:09

How to creat a new data set based on rows of one variable from an existing dataset while each row has multiple observations

1 Answers1