2

This is what I have as data in R:

SQR_1 is linked with Freq1 SQR_2 is linked with Freq2.. and so on.

GUID SESSION_SKEY SEQNUM parent_uid SQR_01 SQR_02 SQR_03 SQR_04 SQR_05 SQR_06 SQR_07 SQR_08 Freq1 Freq2 Freq3 Freq4 Freq5 Freq6 Freq7 Freq8
004ce66617739f9705a73dd001dd5ff7 3.84E+13 56 2216028557 volkite culverin NA NA NA NA NA NA 3 2 NA NA NA NA NA NA
004ce66617739f9705a73dd001dd5ff7 3.84E+13 153 2216028557 contemptor dreadnought volkite NA NA NA NA NA 3 2 3 NA NA NA NA NA
004ce66617739f9705a73dd001dd5ff7 3.84E+13 217 2216028557 land raider prometheus NA NA NA NA NA 9 1 1 NA NA NA NA NA
004ce66617739f9705a73dd001dd5ff7 3.84E+13 12 2216028557 contemptor pattern volkite culverin NA NA NA NA 3 1 3 2 NA NA NA NA
05f7cdbb17a0a45e3fcb79bfffc8817c 3.84E+13 250 1297482930 fake london genius NA NA NA NA NA 2 2 2 NA NA NA NA NA
0827fedf17611f9bede0fab001e6dcad 3.84E+13 62 72778457 teapot for one set NA NA NA NA 1 26 4 21 NA NA NA NA

This is what I want as final output ("Max freq" & "SQR_word" as output):

GUID SESSION_SKEY SEQNUM parent_uid SQR_01 SQR_02 SQR_03 SQR_04 SQR_05 SQR_06 SQR_07 SQR_08 Freq1 Freq2 Freq3 Freq4 Freq5 Freq6 Freq7 Freq8 Max freq SQR_word
004ce66617739f9705a73dd001dd5ff7 3.84E+13 56 2216028557 volkite culverin NA NA NA NA NA NA 3 2 NA NA NA NA NA NA 3 volkite
004ce66617739f9705a73dd001dd5ff7 3.84E+13 153 2216028557 contemptor dreadnought volkite NA NA NA NA NA 3 2 3 NA NA NA NA NA 3 contemptor,volkite
004ce66617739f9705a73dd001dd5ff7 3.84E+13 217 2216028557 land raider prometheus NA NA NA NA NA 9 1 1 NA NA NA NA NA 9 land
004ce66617739f9705a73dd001dd5ff7 3.84E+13 12 2216028557 contemptor pattern volkite culverin NA NA NA NA 3 1 3 2 NA NA NA NA 3 contemptor,volkite
05f7cdbb17a0a45e3fcb79bfffc8817c 3.84E+13 250 1297482930 fake london genius NA NA NA NA NA 2 2 2 NA NA NA NA NA 2 fake,london,genius
0827fedf17611f9bede0fab001e6dcad 3.84E+13 62 72778457 teapot for one set NA NA NA NA 1 26 4 21 NA NA NA NA 26 for
Phil
  • 7,287
  • 3
  • 36
  • 66
rgambhava
  • 47
  • 5
  • Instead of posting pictures, please provide a [reproducible example](https://stackoverflow.com/a/5963610) ("reprex"), by pasting the output of `dput(your_dataset)`. – Greg Jan 04 '22 at 15:03
  • @Greg could you please help? – rgambhava Jan 04 '22 at 15:20
  • Thank you for putting the data in a nice table! But please also paste the output from `dput(your_data)`. – Greg Jan 04 '22 at 15:26

1 Answers1

3

We could use base R to do this - get the index of 'Freq' and 'SQR' columns ('i1', 'i2'), then get the max of the 'Freq' columns with pmax for each row, replace the 'SQR' corresponding columns where the value for 'Freq' columns is not max to NA, loop over the rows with apply (MARGIN = 1), remove the NA and paste the 'SQR' words. Create two new columns in 'df1'

i1 <- startsWith(names(df1), 'Freq')
i2 <- startsWith(names(df1), "SQR")
 
f1 <- function(x) {
        if(all(is.na(x))) {
         NA_character_
        } else paste(na.omit(x), collapse = ",")
}
        
mx <- do.call(pmax, c(df1[i1], na.rm = TRUE))
wrd <- apply(replace(df1[i2], df1[i1] != mx, NA), 1, FUN = f1)
df1[c("MaxFreq", "SQR_word")] <- list(mx, wrd)

-output

> df1
                              GUID SESSION_SKEY SEQNUM parent_uid     SQR_01      SQR_02     SQR_03   SQR_04 SQR_05 SQR_06 SQR_07 SQR_08 Freq1 Freq2 Freq3 Freq4
1 004ce66617739f9705a73dd001dd5ff7     3.84e+13     56 2216028557    volkite    culverin       <NA>     <NA>     NA     NA     NA     NA     3     2    NA    NA
2 004ce66617739f9705a73dd001dd5ff7     3.84e+13    153 2216028557 contemptor dreadnought    volkite     <NA>     NA     NA     NA     NA     3     2     3    NA
3 004ce66617739f9705a73dd001dd5ff7     3.84e+13    217 2216028557       land      raider prometheus     <NA>     NA     NA     NA     NA     9     1     1    NA
4 004ce66617739f9705a73dd001dd5ff7     3.84e+13     12 2216028557 contemptor     pattern    volkite culverin     NA     NA     NA     NA     3     1     3     2
5 05f7cdbb17a0a45e3fcb79bfffc8817c     3.84e+13    250 1297482930       fake      london     genius     <NA>     NA     NA     NA     NA     2     2     2    NA
6 0827fedf17611f9bede0fab001e6dcad     3.84e+13     62   72778457     teapot         for        one      set     NA     NA     NA     NA     1    26     4    21
  Freq5 Freq6 Freq7 Freq8 MaxFreq           SQR_word
1    NA    NA    NA    NA       3            volkite
2    NA    NA    NA    NA       3 contemptor,volkite
3    NA    NA    NA    NA       9               land
4    NA    NA    NA    NA       3 contemptor,volkite
5    NA    NA    NA    NA       2 fake,london,genius
6    NA    NA    NA    NA      26                for

Or may use tidyverse to create the columns - reshape to 'long' format with pivot_longer and do a group by summarise to create the columns and then bind the columns with original data

library(dplyr)
library(tidyr)
library(stringr)
df1 %>% 
   mutate(rn = row_number()) %>% 
   dplyr::select(rn, starts_with("SQR"), starts_with("Freq")) %>% 
   rename_with(~ str_remove(., "_0?")) %>%
   pivot_longer(cols = -rn, names_to = c(".value", "grp"),
       names_sep = "(?<=\\D)(?=\\d)", values_drop_na = TRUE) %>% 
   group_by(rn) %>% 
   summarise(MaxFreq = max(Freq), 
         SQR_word = str_c(SQR[Freq == MaxFreq], collapse=",")) %>% 
   select(-rn) %>%
   bind_cols(df1, .)

-output

                             GUID SESSION_SKEY SEQNUM parent_uid     SQR_01      SQR_02     SQR_03   SQR_04 SQR_05 SQR_06 SQR_07 SQR_08 Freq1 Freq2 Freq3 Freq4
1 004ce66617739f9705a73dd001dd5ff7     3.84e+13     56 2216028557    volkite    culverin       <NA>     <NA>     NA     NA     NA     NA     3     2    NA    NA
2 004ce66617739f9705a73dd001dd5ff7     3.84e+13    153 2216028557 contemptor dreadnought    volkite     <NA>     NA     NA     NA     NA     3     2     3    NA
3 004ce66617739f9705a73dd001dd5ff7     3.84e+13    217 2216028557       land      raider prometheus     <NA>     NA     NA     NA     NA     9     1     1    NA
4 004ce66617739f9705a73dd001dd5ff7     3.84e+13     12 2216028557 contemptor     pattern    volkite culverin     NA     NA     NA     NA     3     1     3     2
5 05f7cdbb17a0a45e3fcb79bfffc8817c     3.84e+13    250 1297482930       fake      london     genius     <NA>     NA     NA     NA     NA     2     2     2    NA
6 0827fedf17611f9bede0fab001e6dcad     3.84e+13     62   72778457     teapot         for        one      set     NA     NA     NA     NA     1    26     4    21
  Freq5 Freq6 Freq7 Freq8 MaxFreq           SQR_word
1    NA    NA    NA    NA       3            volkite
2    NA    NA    NA    NA       3 contemptor,volkite
3    NA    NA    NA    NA       9               land
4    NA    NA    NA    NA       3 contemptor,volkite
5    NA    NA    NA    NA       2 fake,london,genius
6    NA    NA    NA    NA      26                for

data

df1 <- structure(list(GUID = c("004ce66617739f9705a73dd001dd5ff7", "004ce66617739f9705a73dd001dd5ff7", 
"004ce66617739f9705a73dd001dd5ff7", "004ce66617739f9705a73dd001dd5ff7", 
"05f7cdbb17a0a45e3fcb79bfffc8817c", "0827fedf17611f9bede0fab001e6dcad"
), SESSION_SKEY = c(3.84e+13, 3.84e+13, 3.84e+13, 3.84e+13, 3.84e+13, 
3.84e+13), SEQNUM = c(56L, 153L, 217L, 12L, 250L, 62L), parent_uid = c(2216028557, 
2216028557, 2216028557, 2216028557, 1297482930, 72778457), SQR_01 = c("volkite", 
"contemptor", "land", "contemptor", "fake", "teapot"), SQR_02 = c("culverin", 
"dreadnought", "raider", "pattern", "london", "for"), SQR_03 = c(NA, 
"volkite", "prometheus", "volkite", "genius", "one"), SQR_04 = c(NA, 
NA, NA, "culverin", NA, "set"), SQR_05 = c(NA, NA, NA, NA, NA, 
NA), SQR_06 = c(NA, NA, NA, NA, NA, NA), SQR_07 = c(NA, NA, NA, 
NA, NA, NA), SQR_08 = c(NA, NA, NA, NA, NA, NA), Freq1 = c(3L, 
3L, 9L, 3L, 2L, 1L), Freq2 = c(2L, 2L, 1L, 1L, 2L, 26L), Freq3 = c(NA, 
3L, 1L, 3L, 2L, 4L), Freq4 = c(NA, NA, NA, 2L, NA, 21L), Freq5 = c(NA, 
NA, NA, NA, NA, NA), Freq6 = c(NA, NA, NA, NA, NA, NA), Freq7 = c(NA, 
NA, NA, NA, NA, NA), Freq8 = c(NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, 
-6L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • hi @akrun I have more than 5000 records. I am getting the below error: Error: Can't recycle `..1` (size 5000) to match `..2` (size 2966). Run `rlang::last_error()` to see where the error occurred. In addition: Warning message: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1]. – rgambhava Jan 05 '22 at 09:31
  • @rgambhava not sure about the error. Do you have the updated packages? I showed a base R solution as well, which doesn't require any packages. Have you tried that? – akrun Jan 05 '22 at 16:32
  • Yes I have tried that as well. Below is the error I am getting: > wrd <- apply(replace(df6[i2], df6[i1] != mx, NA), 1, + FUN = function(x) paste(na.omit(x), collapse = ",")) Error in `[.data.table`(df6, i2) : i evaluates to a logical vector length 39 but there are 5000 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle. – rgambhava Jan 05 '22 at 17:31
  • > df6[c("MaxFreq", "SQR_word")] <- list(mx, wrd) Error in `[.data.table`(x, i, which = TRUE) : When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM. – rgambhava Jan 05 '22 at 17:31
  • @rgambhava I believe that it is an issue with `na.omit` if there are no non-NA elements. We can create a logical with `FUN = function(x) if(all(is.na(x))) NA_character_ else paste(na.omit(x), collapse = ","))` – akrun Jan 05 '22 at 17:32
  • @rgambhava try the update in the post and see if you have the same issue – akrun Jan 05 '22 at 17:35
  • 1
    @rgambhava your second issue is that you have a `data.table`. whereas I used `data.frame` . Convert to `data.frame` and it should work for the base R i.e. `setDF(df1)` – akrun Jan 05 '22 at 17:37
  • 1
    Thank you so much for your help. You made my day. Updated code is working. – rgambhava Jan 07 '22 at 15:10