2

I have a large dataset (about 12,000 columns) that looks like this

    > df
    ID Group val1 val2 val3
    1 01     a    3    3    3
    2 02     a    4    4    4
    3 03     b    6    6    7
    4 04     c   10   10   19
    5 05     b    2    2    2
    6 06     b    4    4    4
    7 07     c    8    8    8
    8 08     c   12   12   12
  1. loop through each column and get an IQR for each Group.

  2. Then calculate for each column per group a deltaIQR...

    For example

    delta IQR of B = ( IQR of group B - IQR of Group A) / IQR of Group A

    delta IQR of C = (IQR of group C - IQR of Group A) / IQR of Group A

What is the most efficient way to do this?

I attempted a dplyr summarise by Group solution but the df is too big. And also I need to calculate quantiles first, etc. So it gets more unwieldy...

Using the dplyr solution before brings in some errors

df %>%
  group_by(Group) %>%
  summarise_at(vars(matches('val')), IQR) %>% 
  rename_at(-1, ~ paste0(., "_IQR")) %>% 
  mutate_at(vars(matches('val')), list(delta= ~ (. - .[1])/.[1]))

In my actual dataset

> temp
  v6599_IQR v6599_IQR_delta v1554_IQR v1554_IQR_delta
1   0.00191803       0.000000e+00  0.001794153       0.000000e+00
2   0.62698976       3.258926e+02  1.722508234       9.590677e+02
3   0.00191803       7.235440e-15  0.001794153       4.641005e-14
4   0.00191803      -3.617720e-14  2.155928869       1.200642e+03

Now there seems to be an error, because when I calculate the deltaIQR for 3 and 4... the calculation is off, for the first column, delta IQR for rows 3 and 4 should be 0.

M--
  • 25,431
  • 8
  • 61
  • 93
R-MASHup
  • 365
  • 2
  • 9
  • Can you share output of `dput(head(yourdata,10)`? – M-- Jul 14 '19 at 07:04
  • it's so much data, but I've shown you output for 2 columns, for both IQR and deltaIQR, rows 1-4 are the separate groups – R-MASHup Jul 14 '19 at 07:09
  • the syntax that Ive shown you, just shares 10 rows, so size of data aint matter. – M-- Jul 14 '19 at 07:10
  • do you mean my input data or the output of dplyr (see edited post) – R-MASHup Jul 14 '19 at 07:11
  • i have 4 rows = 4 groups and 13k columns, so its getting messy, let me get you 2 columns – R-MASHup Jul 14 '19 at 07:15
  • here are IQR values > dput(IQR_df[1:4, 1:5]) structure(c(0.0531648540599649, 0.0218245928238288, 0.0146718205657, 0.0154911043773313, 0.176883613595922, 0.120291707243372, 0.0850005011158316, 0.0520534927087794, 0.126075199371304, 0.0154998577364283, 0.110140567441898, 0.0237390701827948, 0.137702828818274, 0.0357213021287587, 0.0273324430238793, 0.0261408040809015, 0.02347761391114, 0.0417399139988738, 0.0314575353325635, 0.0527635476248663), .Dim = 4:5, .Dimnames = list(c("grp1", "grp2", "grp3", "grp4"), c("v12245", "v12244", "v12247", "v12246", "v12241"))) – R-MASHup Jul 14 '19 at 07:19
  • What do you mean by delta for 3 and 4? You want to calculate it for each group? You probably need to ask another question reference this one and elaborate on how many actual steps you want to take. For starters, I think you should transpose you data from wide to long cause 13k cols is not efficient. P.S. the reason I am asking for another question is because this is getting deviated from original post and each thread should ask one question and follow ups should not fall out of the original scope. Ping me here, and I will take a look at your 2nd Q. And don't post data to comments. – M-- Jul 14 '19 at 07:27
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/196438/discussion-between-r-mashup-and-m-m). – R-MASHup Jul 14 '19 at 07:33
  • I've edited the post title, because it was misleading to begin with – R-MASHup Jul 14 '19 at 08:12
  • 1
    Are you wondering why the delta IQR is 7.235440e-15 and -3.617720e-14 instead of 0.000000e+00? That's due to floating point arithmetic which is explained in depth [here: Why are these numbers not equal?](https://stackoverflow.com/a/9508558/3817004) – Uwe Jul 17 '19 at 06:53

1 Answers1

2

Update:

To calculate deltaIQR I am using dplyr.

library(dplyr)

df %>%
  group_by(Group) %>%
  summarise_at(vars(matches('val')), IQR) %>% 
  rename_at(-1, ~ paste0(., "_IQR")) %>% 
  mutate_at(vars(matches('val')), list(delta= ~ (. - .[1])/.[1]))

#> # A tibble: 3 x 7
#>   Group val1_IQR val2_IQR val3_IQR val1_IQR_delta val2_IQR_delta val3_deltaIQR
#>   <fct>    <dbl>    <dbl>    <dbl>          <dbl>          <dbl>         <dbl>
#> 1 a          0.5      0.5      0.5              0              0             0
#> 2 b          2        2        2.5              3              3             4
#> 3 c          2        2        5.5              3              3            10

Thanks to akrun for his comment on dplyr solution


Looping through columns to calculate IQR can be done in base:

sapply(df[,3:5], function(x) tapply(x, df$Group, IQR))

#>   val1 val2 val3
#> a  0.5  0.5  0.5
#> b  2.0  2.0  2.5
#> c  2.0  2.0  5.5

Data:

df <- read.table(text="ID Group val1 val2 val3
                       01     a    3    3    3
                       02     a    4    4    4
                       03     b    6    6    7
                       04     c   10   10   19
                       05     b    2    2    2
                       06     b    4    4    4
                       07     c    8    8    8
                       08     c   12   12   12", header=T)
Community
  • 1
  • 1
M--
  • 25,431
  • 8
  • 61
  • 93