Select columns based on columns sum

Question

Any suggestion to select the columns of the row when value =1 and the sum columns values =1. it means that I will just select unique values, non-shared with the other individuals.

indv. X Y Z W T J
A     1 0 1 0 0 1
B     0 1 1 0 0 0
C     0 0 1 1 0 0
D     0 0 1 0 1 0

A: X, J
B: Y
C: W
D: T

I would go with `indx <- which((colSums(x) < 2)[col(x)] & (x > 0), arr.ind = TRUE) ; data.frame(res = tapply(names(x)[indx[, "col"]], indx[, "row"], toString))` (if `x` is your data set) — David Arenburg, Mar 12 '18 at 10:08
With my data, it shows the non-shared items but not shows the row.names for each set. For example, it shows the number of the line. ***res 10 VFG016167, VFG043115*** — F.Lira, Mar 12 '18 at 10:29
I'm not a big fan of rownames, but it is a small fix `data.frame(res = tapply(names(x)[indx[, "col"]], rownames(x)[indx[, "row"]], toString))` (P.S. I'm assuming that `indv` are row names?) — David Arenburg, Mar 12 '18 at 10:35
Next time, make a reproducible example so there's no ambiguity about the structure of your data. https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 — Frank, Mar 12 '18 at 14:27
In this manner, I uploaded my file at: https://github.com/felipelira/files_to_test/blob/master/matrix_renamed.tab so other persons can understand the problem. It is in another question https://stackoverflow.com/review/suggested-edits/19083066 — F.Lira, Mar 12 '18 at 14:52

Sotos · Answer 1 · 2018-03-12T10:00:36.990

3

One idea is to use rowwise apply to find the columns with 1, after we filter out the columns with sum != to 1, i.e.

apply(df[colSums(df) == 1], 1, function(i) names(df[colSums(df) == 1])[i == 1])

$A
[1] "X" "J"

$B
[1] "Y"

$C
[1] "W"

$D
[1] "T"

You can play around with the output to get it to desired state, i.e.

apply(df[colSums(df) == 1], 1, function(i) toString(names(df[colSums(df) == 1])[i == 1]))
#     A      B      C      D 
#"X, J"    "Y"    "W"    "T"

Or

data.frame(cols = apply(df[colSums(df) == 1], 1, function(i) toString(names(df[colSums(df) == 1])[i == 1])))

#  cols
#A X, J
#B    Y
#C    W
#D    T

edited Mar 12 '18 at 10:00

answered Mar 12 '18 at 09:49

Sotos

51,121
6
32
66

In the case that I just want the names and values of the rown that cumply the condition, how could I print only them? – F.Lira Mar 12 '18 at 11:56

Lodewic Van Twillert · Accepted Answer · 2018-03-12T10:37:36.760

Here you go! A solution in base r. First we simulate your data, a data.frame with named rows and columns.

You can use sapply() to loop over the column indices. A for-loop over the column indices will achieve the same thing.

Finally, save the results in a data.frame however you want.

# Simulate your example data
df <- data.frame(matrix(c(1, 0, 1, 0, 0, 1,
                          0, 1, 1, 0, 0, 0,
                          0, 0, 1, 1, 0, 0,
                          0, 0, 1, 0, 1, 0), nrow = 4, byrow = T))


# Names rows and columns accordingly
names(df) <- c("X", "Y", "Z", "W", "T", "J")
rownames(df) <- c("A", "B","C", "D")

> df
  X Y Z W T J
A 1 0 1 0 0 1
B 0 1 1 0 0 0
C 0 0 1 1 0 0
D 0 0 1 0 1 0

Then we select columns where the sum == 1- columns with unique values. For every one of these columns, we find the row of this value.

# Select columns with unique values (if sum of column == 1)
unique.cols <- which(colSums(df) == 1)
# For every one of these columns, select the row where row-value==1
unique.rows <- sapply(unique.cols, function(x) which(df[, x] == 1))

> unique.cols
X Y W T J 
1 2 4 5 6 

> unique.rows
X Y W T J 
1 2 3 4 1

The rows are not named correctly yet (they are still the element named of unique.cols). So we reference the rownames of df to get the rownames.

# Data.frame of unique values
#   Rows and columns in separate columns
df.unique <- data.frame(Cols = unique.cols,
                    Rows = unique.rows,
                    Colnames = names(unique.cols),
                    Rownames = rownames(df)[unique.rows],
                    row.names = NULL)

The result:

df.unique
  Cols Rows Colnames Rownames
1    1    1        X        A
2    2    2        Y        B
3    4    3        W        C
4    5    4        T        D
5    6    1        J        A

Edit:

This is how you could summarise the values per row using dplyr.

library(dplyr)

df.unique %>% group_by(Rownames) %>%
  summarise(paste(Colnames, collapse=", "))




   # A tibble: 4 x 2
  Rownames `paste(Colnames, collapse = ", ")`
  <fct>    <chr>                             
1 A        X, J                              
2 B        Y                                 
3 C        W                                 
4 D        T

Nice workflow. No way to group the values? for example, A = X and J? — F.Lira, Mar 12 '18 at 10:05
Rownames "`"paste(Colnames, collapse = ", ")"`" Error: unexpected symbol in "Rownames `paste(Colnames, collapse = ", ")`" — F.Lira, Mar 12 '18 at 10:43
Sorry, the bottom table is only the output. The only command you need is: `df.unique %>% group_by(Rownames) %>% summarise(paste(Colnames, collapse=", "))` Or did the error come from there? It works for me. — Lodewic Van Twillert, Mar 12 '18 at 11:20
No no, the error was because I included the "Rownames `paste(Colnames, collapse = ", ")`"in the script. — F.Lira, Mar 12 '18 at 15:10

akrun · Answer 3 · 2018-03-12T10:03:31.000

Here is an option with tidyverse. We gather the dataset to 'long' format, grouped by 'key', fiter the rows where 'val' is 1 and the sum of 'val is 1, grouped by 'indv.', summarise the 'key' by pasteing the elements together

library(dplyr)
library(tidyr)
gather(df1, key, val, -indv.) %>%         
     group_by(key) %>% 
     filter(sum(val) == 1, val == 1) %>%
     group_by(indv.) %>% 
     summarise(key = toString(key))
# A tibble: 4 x 2
#   indv. key  
#   <chr> <chr>
#1 A     X, J 
#2 B     Y    
#3 C     W    
#4 D     T

Select columns based on columns sum

3 Answers3