R: How to modify a CreateTable() function for reiterated observations and with the wrongs index?

Question

I'm trying creating a table on the following dataset which I'm reporting here the very first fifty observations. Here following it is reported the dataset I'm working on.

enter link description here

There are some typos for age and gnder variable that I susggest to fix as follows:

colnames(d)[8] <- 'COND'
d$gender = ifelse(tolower(substr(d$gender,1,1)) == "k", "F", "M") 
library(libr)
d <- datastep(d, {
  if (is.na(age)) {
    age <- 21
  }}
)

I'm trying to create a summary table by using the following code:

CreateTableOne(
  vars = c('TASK', 'COND', 't1.key', 'T1.response', 'age', 'T1.ACC'), 
  strata = c('ID'),
  factorVars = c('gender'), 
  argsApprox = list(correct = FALSE), 
  smd = TRUE, 
  addOverall = TRUE, 
  test = TRUE) %>% 
  na.omit() %>% 
  kableone()

obtaning this table

However how you see from this function, as I have many observation for the same subject, I count just 54 IDs and therefore the number of females and males is incorrect.

length(unique(d$ID)) 
[1] 54

Anyone knows how to fix it? And furthermore as the 'age' and 'T1.ACC' have non-normal distribution anyone knows how I could replace them with median and Q1 and Q3, for example?

You need to put in more data because the ones you gave cause `CreateTableOne` to return errors. First check for yourself if the data you put in `CreateTableOne` have correct results. Only then can we try to help you. — Marek Fiołka, Sep 05 '21 at 14:11
If you possibly suggest a code to report as much as possible obserbvation as actually it contains over than 40000. — 12666727b9, Sep 05 '21 at 18:50

Marek Fiołka · Accepted Answer · 2021-09-14T21:34:31.120

I would like to help you. However, there are the following problems with the data you provide:

The variable COND is missing
Only one unique value of the TASK variable (the CreateTableOne function does not accept variables with one unique value).
Only one unique value for the variable age.
The variable ID is repeated several times.

However, even without changing your data, you can see what your problem is. If you have data in this form, you cannot use CreateTableOne! This is because it counts every occurrence of the value m and every occurrence of the value k. And since you have multiple entries for one person, the CreateTableOne function will count each occurrence separately.

Please take a look at the solution I have proposed here How to describe unique values of grouped observations for several vars?.

Update 1

OKAY. Let's try to face your data. You have 54 patients with different IDs.

data_Confidence_in_Action %>% distinct(ID) %>% nrow()
#[1] 54

However, note that one ID appears to be incorrect.

data_Confidence_in_Action %>% distinct(ID) %>%
  mutate(lenID = str_length(ID)) %>% filter(lenID!=5)
#  A tibble: 1 x 2
#  ID         lenID
#  <chr>      <int>
#1 P1419 dots    10

However, we can leave it as it is. Correct it yourself if you have to. However, remember that you have as many as 8 different genders. Be careful because in our country the gender ideology is not well received ;-)

data_Confidence_in_Action %>% distinct(gender)
#  A tibble: 8 x 1
#  gender     
#  <chr>      
#1 k          
#2 kobieta    
#3 M          
#4 K          
#5 m¦Ö+-czyzna
#6 21         
#7 m          
#8 M¦Ö+-czyzna

This, unfortunately, needs to be fixed. Unfortunately, patient P1440 was assigned age by gender. So what is the gender of the P1440?

data_Confidence_in_Action %>% filter(gender==21) %>% distinct(ID, gender, age)
#  A tibble: 1 x 3
#  ID    gender   age
#  <chr> <chr>  <dbl>
#1 P1440 21        NA

data_Confidence_in_Action %>% distinct(ID, gender) %>% 
  group_by(gender) %>% summarise(n = n())
#  A tibble: 8 x 2
#  gender          n
#  <chr>       <int>
#1 21              1
#2 k              36
#3 K               3
#4 kobieta         9
#5 m               1
#6 M               1
#7 m¦Ö+-czyzna     2
#8 M¦Ö+-czyzna     1

As you can see, you have more women. So let P1440 be a woman. Will be OK?

Finally, notice that the two variables have inconvenient names. It is about Condition (whether a person responded) and Go / Nogo (whether a person should respond).

Let's fix it all in one go.

data_Confidence_in_Action = data_Confidence_in_Action %>% 
  mutate(
    gender = ifelse(str_detect(gender, "[k,K,21]"),"k","m"),
    age = ifelse(is.na(age), 21, age)
  ) %>% rename(Condition=`Condition (whether a person responded)`, 
               Go.Nogo = `Go/Nogo (whether a person should respond)`)

Finally, let's change some of the variables from chr to factor, but don't replace the correct levels. I hope I took it wisely.

data_Confidence_in_Action = data_Confidence_in_Action %>% 
  mutate(
    ID = ID %>% fct_inorder(),
    gender = gender %>% fct_infreq(),
    t1.key = t1.key %>% fct_infreq(),
    Condition = Condition %>% fct_infreq(),
    CR.key = CR.key %>% fct_infreq(),
    TASK = TASK %>% fct_infreq(),
    Go.Nogo = Go.Nogo %>% fct_infreq(),
    difficulty = difficulty %>% factor(c("easy", "medium", "hard"))
  )

With the data organized in such a way, let's get to the heart of the problem. What do you really want to analyze. Note that for variables such as TASK, Condition, and t1.key, there are both valid values for each applicant.

data_Confidence_in_Action %>% group_by(ID) %>% summarise(
  nunique.TASK = length(unique(TASK)),
  nunique.Condition = length(unique(Condition)),
  nunique.t1.key = length(unique(t1.key))
) %>% distinct(nunique.TASK, nunique.Condition, nunique.t1.key)
#  A tibble: 1 x 3
#  nunique.TASK nunique.Condition nunique.t1.key
#         <int>             <int>          <int>
#1            2                 2              2

However, if we look at the proportions of the occurrence of different values in these variables, they are different in each patient.

data_Confidence_in_Action %>% group_by(ID) %>% summarise(
  prop.TASK = sum(TASK=="left")/sum(TASK=="right")) %>% 
  distinct()

data_Confidence_in_Action %>% group_by(ID) %>% summarise(
  prop.Condition = sum(Condition=="NR")/sum(Condition=="R"))%>% 
  distinct()

data_Confidence_in_Action %>% group_by(ID) %>% summarise(
  prop.t1.key = sum(t1.key=="None")/sum(t1.key=="space"))%>% 
  distinct()

So write clearly what and how you want to summarize because it is not clear to me what you want to get.

Update 2

OKAY. I can see that you are beginning to understand something. Still, I don't know what you want to sum up. Look below. First, let's collect all the code to prepare the data

library(tidyverse)
library(readxl)
library(tableone)
data_Confidence_in_Action <- read_excel("data_Confidence in Action.xlsx")

data_Confidence_in_Action = data_Confidence_in_Action %>%
  mutate(
    gender = ifelse(str_detect(gender, "[k,K,21]"),"k","m"),
    age = ifelse(is.na(age), 21, age)
  ) %>% rename(Condition=`Condition (whether a person responded)`,
               Go.Nogo = `Go/Nogo (whether a person should respond)`)

data_Confidence_in_Action = data_Confidence_in_Action %>%
  mutate(
    ID = ID %>% fct_inorder(),
    gender = gender %>% fct_infreq(),
    t1.key = t1.key %>% fct_infreq(),
    Condition = Condition %>% fct_infreq(),
    CR.key = CR.key %>% fct_infreq(),
    TASK = TASK %>% fct_infreq(),
    Go.Nogo = Go.Nogo %>% fct_infreq(),
    difficulty = difficulty %>% factor(c("easy", "medium", "hard"))
  )

And now the summary. If we do this:

CreateTableOne(
  data = data_Confidence_in_Action,
  vars = c('TASK', 'Condition', 't1.key', 'T1.response', 'age', 'T1.ACC'), 
  strata = 'gender',
  factorVars = c('TASK', 'Condition', 't1.key'), 
  argsApprox = list(correct = FALSE), 
  smd = TRUE, 
  addOverall = TRUE, 
  test = TRUE) %>% 
  kableone()

output

|                        |Overall      |k            |m            |p      |test |
|:-----------------------|:------------|:------------|:------------|:------|:----|
|n                       |41713        |37823        |3890         |       |     |
|TASK = right (%)        |20832 (49.9) |18889 (49.9) |1943 (49.9)  |0.992  |     |
|Condition = R (%)       |20033 (48.0) |18130 (47.9) |1903 (48.9)  |0.241  |     |
|t1.key = space (%)      |20033 (48.0) |18130 (47.9) |1903 (48.9)  |0.241  |     |
|T1.response (mean (SD)) |0.48 (0.50)  |0.48 (0.50)  |0.49 (0.50)  |0.241  |     |
|age (mean (SD))         |20.74 (2.67) |20.75 (2.70) |20.60 (2.33) |0.001  |     |
|T1.ACC (mean (SD))      |0.70 (0.46)  |0.70 (0.46)  |0.73 (0.45)  |<0.001 |     |

we get a summary for all observations that is n == 41713. And since there are many observations for each patient, such a summary is of little use. At least I think so. However, we can summarize for a few selected patients.

CreateTableOne(
  data = data_Confidence_in_Action %>% 
    filter(ID %in% c('P1323', 'P1403', 'P1404')) %>% 
    mutate(ID = ID %>% fct_drop()),
  vars = c('TASK', 'Condition', 't1.key', 'T1.response', 'age', 'T1.ACC'), 
  strata = c('ID'),
  factorVars = c('TASK', 'Condition', 't1.key'), 
  argsApprox = list(correct = FALSE), 
  smd = TRUE, 
  addOverall = TRUE, 
  test = TRUE) %>% 
  kableone()

output

|                        |Overall      |P1323        |P1403        |P1404        |p      |test |
|:-----------------------|:------------|:------------|:------------|:------------|:------|:----|
|n                       |2323         |775          |776          |772          |       |     |
|TASK = right (%)        |1164 (50.1)  |390 (50.3)   |386 (49.7)   |388 (50.3)   |0.969  |     |
|Condition = R (%)       |1168 (50.3)  |385 (49.7)   |435 (56.1)   |348 (45.1)   |<0.001 |     |
|t1.key = space (%)      |1168 (50.3)  |385 (49.7)   |435 (56.1)   |348 (45.1)   |<0.001 |     |
|T1.response (mean (SD)) |0.50 (0.50)  |0.50 (0.50)  |0.56 (0.50)  |0.45 (0.50)  |<0.001 |     |
|age (mean (SD))         |19.66 (0.94) |19.00 (0.00) |19.00 (0.00) |21.00 (0.00) |<0.001 |     |
|T1.ACC (mean (SD))      |0.70 (0.46)  |0.67 (0.47)  |0.77 (0.42)  |0.65 (0.48)  |<0.001 |     |

This makes more sense now, but is separate for each patient.

Alternatively, you can do this summary without using CreateTableOne, e.g. yes

data_Confidence_in_Action %>% group_by(gender, ID) %>% 
  summarise(
    age = min(age)) %>% group_by(gender) %>% 
  summarise(
    n = n(),
    Min = min(age),
    Q1 = quantile(age,1/4,8),
    mean = mean(age),
    median = median(age),
    Q3 = quantile(age,3/4,8),
    Max = max(age),
    IQR = IQR(age),
    Kurt = e1071::kurtosis(age),
    skew = e1071::skewness(age),
    SD = sd(age))

output

# A tibble: 2 x 12
  gender     n   Min    Q1  mean median    Q3   Max   IQR  Kurt  skew    SD
  <fct>  <int> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 k         49    19    19  20.8     20    21    32     2  7.47 2.79   2.73
2 m          5    19    19  20.6     19    21    25     2 -1.29 0.823  2.61

Think carefully and write down what you really expect. Unless, of course, this topic is still interesting for you.

Ok, thanks. I know wgat the problem is. I can provide the right dataset with this typos fixed — 12666727b9, Sep 06 '21 at 20:30
The ID is repeated several time beacuse for every of the 53 subjects are given different measurements — 12666727b9, Sep 06 '21 at 20:39
I think that with the one you could dispose of, it must be hard in any case, maybe because 'M' gender is not taken into account and that's why it would easier to report data in other way. Just let me know if you are able to handle with this data — 12666727b9, Sep 06 '21 at 20:42
Does the solution I showed you (link) by any chance solve your problem? Of course you will have to adjust it to your variables. — Marek Fiołka, Sep 06 '21 at 20:42
Or can you save all the data to a csv file and share it here? — Marek Fiołka, Sep 06 '21 at 20:44
this is actually the best way. Do you how may I edit and share the .csv file? — 12666727b9, Sep 06 '21 at 21:07
You are the author of the post. You can always edit it and add a link to the file. Place the file on e.g. a dropbox or google drive and share the link to it. However, write exactly what summary you expect. — Marek Fiołka, Sep 06 '21 at 21:16
Dropped the dataset. It is a .xls file at the link I've mentioned. I'd like to reproduce the table in the picture but just for the single observation, if possible. — 12666727b9, Sep 07 '21 at 10:06
Great, this are useful but not for getting the final table I'm looking for and you could see reported in the picture. Basically I'm looking to create the same descriptive table with the following hallmarks: 1) be grouped by gender but just the unique singular observation (k = 49, m = 5) for the variable that are there reported, 2) showing median and Q1Q3 instead of the mean and SD for 'age' as it doesn't follow the normal distribution. — 12666727b9, Sep 08 '21 at 13:08
Really useful alternative to learn. However, the table I'd like to create is at very top of my post. It would be the same, stratified for gender BUT JUST WITH SINGLE OBSERVATIONS (like in the tibble you've created), due to the fact that the datset contains repeated observations. Furthermore, as variable like 'age' does not fit a normal distribution I would like to know how modifiy index reported automatically in table or HOW TO REPLACE MEAN(SD) with median(Q1Q3). If this is not possible, it does not matter. — 12666727b9, Sep 15 '21 at 09:49
Please continue the chat discussion https://chat.stackoverflow.com/rooms/info/237134/medical-data-analysis-pl — Marek Fiołka, Sep 15 '21 at 14:26
I know that probably you'll get bothered about this, but I think you are the one user that will be able to figure a problem out. If you would like to live your contribute to this https://stackoverflow.com/questions/69573318/how-to-create-a-submultiple-level-list it would be fine, if you are willing to. — 12666727b9, Oct 14 '21 at 18:37
I'll have a look there tomorrow when I have a moment. By the way, did you finally manage to bring the above problem to a satisfactory end? I waited for chat but you didn't answer. — Marek Fiołka, Oct 14 '21 at 20:49
I'm trying to enter through the link you've sent, but actually it seems not to be possible to have access. Howver that was pretty instructive. Thank you so much. Just a little remark: if you don't mind, please have a look to this https://stackoverflow.com/questions/69528926/how-to-print-on-a-serie-sof-graphs-pairwise-comparisons-bars-and-effect-size-val question before (although it could be related to the one I've pointed out in the previous comment). Thanks you — 12666727b9, Oct 14 '21 at 21:00
Yes, I thought that this question came from an earlier one. Please give me all the links to related topics so I don't have to guess what you want. — Marek Fiołka, Oct 14 '21 at 21:06
Boxplots with comparisons can be done. I've done something like this before. but I need to refresh myself. write exactly what you mean with this 36-item list. what you really want to get. — Marek Fiołka, Oct 14 '21 at 21:09
As for the chat, it was in Polish. Feel no one was wearing there. Probably that's why it was removed. — Marek Fiołka, Oct 14 '21 at 21:12
Sure....It is what I've explained here https://stackoverflow.com/questions/69528926/how-to-print-on-a-serie-sof-graphs-pairwise-comparisons-bars-and-effect-size-val if you need some other details we can always communicate differently. — 12666727b9, Oct 14 '21 at 21:13
It would be a good chance to learn it, since I'm here in Poland now — 12666727b9, Oct 14 '21 at 21:14
Okay, I'll look at it like I wrote tomorrow. Saturday at the latest. — Marek Fiołka, Oct 14 '21 at 21:21
Before I start solving your problem, I am asking you to answer a few questions. 1. What is the size of your actual data? Is the data included in the questions all this? I hate to run into the problem again that the amount of observations in groups will cause problems with some statistical functions. If there is more of this data, please provide me here (link in the comment) with the remaining data. I will confirm when I download them you are able to delete this comment along with the link. — Marek Fiołka, Oct 15 '21 at 13:21
2. Did you perform the Shapiro-Wilka test on your data? This is extremely important because in the event of non-compliance with the normal distribution, you will not be able to use any parametric comparative tests (t-student, ANOVA, etc.). 3. Do you allow the p-value of the Shapiro-Wilk test to be plotted? I recommend that you include this statistic. 4. If not in accordance with the normal distribution, do you allow the use of the non-parametric Wilcoxon test? — Marek Fiołka, Oct 15 '21 at 13:21
5. If it turns out that all the data comes from the normal distribution, which will be confirmed by the Shapiro-Wilk test, what kind of comparative test do you want to use? T-student, ANOVA or some other. — Marek Fiołka, Oct 15 '21 at 13:25
6. Do you prefer Kolmogorov-Smirnov test instead of Shapiro-Wilk? If so, write about it. 7. Have you tried to analyze your data on the QQ chart? This may give additional information about the normality of the distribution. — Marek Fiołka, Oct 15 '21 at 13:28
8. Do you allow an additional violin plot besides the boxplot and points (jitter spread)? This will show what the actual density distribution of your results is. — Marek Fiołka, Oct 15 '21 at 13:30
1. No actually, my dataset contains 75 obsevation. Don't know how to report it in all its length at that link page. 2 & 4. & 7 I've the Shapiro Test (that it is ok) among ID, there are same case in which that is not respected and I though to procedd to split them from the dataset and separately run a Kruskall.test. 3 & 7. I've tried the function you suggested but the problem is that I do not know which subset of data I should use (I just used some boxplot for this purpose. I've tried also with qqplot but was not able to adjus to plot the). 5. Pairwise comparison with bonferroni adjustement. — 12666727b9, Oct 15 '21 at 13:50
Dołącz proszę do tego [pokoju cztu](https://chat.stackoverflow.com/rooms/238219/dywagacje-statystyczne-pl). Czekam tam na Ciebie. — Marek Fiołka, Oct 16 '21 at 12:40

R: How to modify a CreateTable() function for reiterated observations and with the wrongs index?

1 Answers1