Duplicate the rows based on some criteria in SQL or R

Question

I use R to generate a toy set

data.frame(name = c("Tom", "Shane", "Daniel", "Akira", "Jack", "Zoe"), c1 = c(1,2,3,0,5,0), c2 = c(0, 3, 5, 0,4,0), c3 = c(0, 0,1,0,0,3), c4=c(0,0,0,1,0,0))

which is displayed below:

I only care about the columns c1, c2, c3, c4, and if a specific row has more than one value, which is greater than 0. we need to duplicate rows to make sure that there are only one value, which is greater than 0, and then remove the original row.

For instance, the second row has two values are greater than 0 (c1: 2, c2: 3), then we have to duplicate that row to two, which looks like this

Shane 2 0 0 0

Shane 0 3 0 0

I am trying to build a SQL query to capture this. However, I am not sure if there is any SQL function can detect multiple non-zero values in a specific row without looking at the result first. Anyway the final result should look like this, if there any magical SQL functions exist:

I also think about to use R to accomplish it. The only R function I know can duplicate rows is do.call() function, then combine it with rbind() function. However, it is not working for my case. Could you someone give me any hints? Many Thanks :)

Please, when posting data, post *actual data* (e.g., with `dput`) and not an image of it. I cannot highlight this image and paste into my R session to try to play with things in order to help you out. I am not nearly bored enough to transcribe all of your data into a sample data.frame to play. I suggest you read about [reproducible questions](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for more guidance on how to do this well, making it easier on the rest of us. Thanks! — r2evans, Jan 04 '18 at 17:53
@r2evans: Sorry about that. this is the R code to generate the toy set:` data.frame(name = c("Tom", "Shane", "Daniel", "Akira", "Jack", "Zoe"), c1 = c(1,2,3,0,5,0), c2 = c(0, 3, 5, 0,4,0), c3 = c(0, 0,1,0,0,3), c4=c(0,0,0,1,0,0))` — Akira, Jan 04 '18 at 17:55
Please don't post code in comments. Edit the question, it'll look much better. — r2evans, Jan 04 '18 at 17:56

score 2 · Answer 1 · answered Jan 04 '18 at 18:04

You can do this with a few tidyverse functions. First, we enter your sample data

library(tidyverse)
dd <- tribble(~name, ~c1, ~c2, ~c3, ~c4,
        "Tom", 1, 0, 0, 0,
        "Shane", 2, 3, 0, 0,
        "Daniel", 3, 5, 1, 0,
        "Akira", 0, 0, 0 ,1,
        "Jack", 5, 4, 0, 0,
        "Zoe", 0, 0, 3, 0)

Then we gather, filter, and spread to get the rows you want. By adding in a row id, we keep the different values on different rows.

dd %>% 
  gather("var", "val", -name) %>% 
  rowid_to_column() %>% 
  filter(val>0) %>% 
  spread(var, val, fill=0) %>% 
  select(-rowid)
# A tibble: 10 x 5
#      name    c1    c2    c3    c4
#  *  <chr> <dbl> <dbl> <dbl> <dbl>
#  1    Tom     1     0     0     0
#  2  Shane     2     0     0     0
#  3 Daniel     3     0     0     0
#  4   Jack     5     0     0     0
#  5  Shane     0     3     0     0
#  6 Daniel     0     5     0     0
#  7   Jack     0     4     0     0
#  8 Daniel     0     0     1     0
#  9    Zoe     0     0     3     0
# 10  Akira     0     0     0     1

Thanks so much for elegant and beautiful solution. One question is that what does "val" and "var" mean in the code? — Akira, Jan 04 '18 at 18:21
The gather step adds new columns and those are just the names I choose for those new columns. You could call them something else if you like. — MrFlick, Jan 04 '18 at 18:30

score 2 · Accepted Answer · answered Jan 04 '18 at 18:06

2

Perhaps another option using a CROSS APPLY

Example

Select A.Name
      ,B.*
 From  YourTable A
 Cross Apply ( values (C1,0,0,0)
                     ,(0,C2,0,0)
                     ,(0,0,C3,0)
                     ,(0,0,0,C4)
             ) B (C1,C2,C3,C4)
 Where B.C1+B.C2+B.C3+B.C4<>0

Returns

answered Jan 04 '18 at 18:06

John Cappelletti

79,615
7
44
66

but @John .. you sure the OP is using SQL Server? – Vamsi Prabhala Jan 04 '18 at 18:07
@VamsiPrabhala Answered a SQL Server question for her last night and OP pinged me on this one as well. So am I 100% sure ... no, but I'll go with 98% sure :) – John Cappelletti Jan 04 '18 at 18:11
@JohnCappelletti Thanks a lot. I never heard about "Cross Apply" function before, and it works great. Learned a lot of cool stuff today! – Akira Jan 04 '18 at 20:03
@AkiraKaneshiro Very useful function. I tend to think of it as a sub-routine. Cross Apply will exclude NULL record sets while Outer Apply will include NULL records sets. I learn something everyday on SO. That's the fun part – John Cappelletti Jan 04 '18 at 20:06

score 1 · Answer 3 · answered Jan 04 '18 at 18:12

1

One more option using union all.

select name,c1,0 as c2,0 as c3,0 as c4 from tbl where c1>0
union all
select name,0,c2,0,0 from tbl where c2>0
union all
select name,0,0,c3,0 from tbl where c3>0
union all
select name,0,0,0,c4 from tbl where c4>0

answered Jan 04 '18 at 18:12

Vamsi Prabhala

48,685
4
36
58

score 1 · Answer 4 · answered Jan 04 '18 at 18:26

df1 = data.frame(name = c("Tom", "Shane", "Daniel", "Akira", "Jack", "Zoe"),
                 c1 = c(1,2,3,0,5,0),
                 c2 = c(0, 3, 5, 0,4,0),
                 c3 = c(0, 0,1,0,0,3),
                 c4=c(0,0,0,1,0,0))

df2 = df1[rep(1:NROW(df1), apply(df1, 1, function(x) sum(x[-(1)] > 0))),]
df3 = df2
df3[-1] = df3[-1] * 0
df3[ave(1:NROW(df2), df2$name, FUN = length) == 1,] = df2[ave(1:NROW(df2), df2$name, FUN = length) == 1,]
replace(x = df3,
        list = cbind(1:NROW(df3), 1+ave(1:NROW(df2), df2$name, FUN = seq_along)),
        values = df2[cbind(1:NROW(df3), 1+ave(1:NROW(df2), df2$name, FUN = seq_along))])
#      name c1 c2 c3 c4
#1      Tom  1  0  0  0
#2    Shane  2  0  0  0
#2.1  Shane  0  3  0  0
#3   Daniel  3  0  0  0
#3.1 Daniel  0  5  0  0
#3.2 Daniel  0  0  1  0
#4    Akira  0  0  0  1
#5     Jack  5  0  0  0
#5.1   Jack  0  4  0  0
#6      Zoe  0  0  3  0

score 1 · Answer 5 · answered Jan 04 '18 at 20:04

Consider base R with by that builds a zero padded dataframe for each distinct name then row binds all dataframes into final one, similar to union SQL:

df_list <- by(df, df$name, FUN = function(d){

  tmp <- data.frame(name = d$name[1],
             c1 = c(max(d$c1), rep(0, 3)),
             c2 = c(0, max(d$c2), rep(0, 2)),
             c3 = c(rep(0, 2), max(d$c3), 0),
             c4 = c(rep(0, 3), max(d$c4)))

  tmp <- tmp[rowSums(tmp[-1])!=0,]
  row.names(tmp) <- NULL
  tmp

})

final_df <- do.call(rbind, unname(df_list))
final_df

Duplicate the rows based on some criteria in SQL or R

5 Answers5

Linked