How to remove redundant rows in a data.frame (by columns [1, 2] and vice versa)?

Question

I obtained a distance.class table where samples where compared against each other to calculate an index. As a result, each value is duplicated as well as self comparisons occur. See example table below:

	Sample1	Sample2	Sample3
Sample1	0	0.5	1
Sample2	0.5	0	0.8
Sample3	1	0.8	0

I already removed the self comparisons (sample1 vs sample1 etc.) But I do not know how to remove the redundant values (i. e. the upper half of the table). Desired output is a table like below, which I can then melt into a data.frame to build plots with. The samples are also of a specific type which I want to use to build the plots.

	Sample1	Sample2
Sample1
Sample2	0.5
Sample3	1	0.8

Var1	Var2	Type1	Type2	Value
Sample1	Sample2	a	b	0.5
Sample1	Sample3	a	a	1
Sample2	Sample3	b	a	0.8

Can you share the step before your first result? You can probably modify your code to go directly to your desired output. — ktiu, Jun 22 '21 at 15:36
Hi @ktiu This is the code I used: `library(phyloseq)` `library(rehsape2)` `library(dplyr)` `# calculate distances and coerce into matrix` `# this generates the first table in my starting post` `wu = phyloseq::distance(physeqObject, DistanceMeasure)` `wu.m = melt(as.matrix(wu)) %>%` `mutate_if(is.factor,as.character)` `# remove self-comparison` `wu.m = wu.m %>%` `filter(as.character(Var1) != as.character(Var2)) %>%` `mutate_if(is.factor, as.character)` — plicht, Jun 23 '21 at 07:35
# get sample data from phyloseq object and combine with with distance matrix # this generates the last table from my starting post `sd = as.matrix(physeqObject@sam_data) %>% as.data.frame(sd) %>% select(sample, sampleType) %>% mutate_if(is.factor,as.character)` `colnames(sd) = c("Var1", "Type1")` `wu.sd = left_join(wu.m, sd, by = "Var1")` `colnames(sd) = c("Var2", "Type2")` `wu.sd = left_join(wu.sd, sd, by = "Var2")` — plicht, Jun 23 '21 at 07:38
This is all very hard to reproduce. My tip is to look into `usedist::dist_make()`, but in order to tailor it to your code, it would be necessary that you provide us with a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example#5963610) that we can copy and paste to better understand the issue and test possible solutions. — ktiu, Jun 23 '21 at 08:21

score 0 · Answer 1 · answered Jun 25 '21 at 13:21

Thanks a lot, with usedist::dist_make() I was able to produce the intended solution.

After generating the class "dist" matrix calling phyloseq::distance(), I extracted the grouping variables from the phyloseq object with:

group2samp <- list() 
    group_list <- get_variable(sample_data(physeq), group) 
    for (groups in levels(group_list)) { # loop over the no. of group levels
        target_group <- which(group_list == groups) 
        group2samp[[ groups ]] <- sample_names(physeq)[target_group] 
    }

Then I melted the resulting "group2samp" list and rearranged the order of the first column to match with my distance matrix:

library(reshape2)    
item_groups = melt(group2samp)

library(dplyr)
item_groups = arrange(item_groups, value)
# needed to reverse the column to match with my distance matrix
item_groups = item_groups[order(nrow(item_groups):1),]
item_groups = item_groups$L1 #extract only grouping variable

library(usedist)
distances = dist_groups(distance_matrix, item_groups)

distances
     Item1    Item2      Group1      Group2                          Label   Distance
1    sample9  sample8       Patch      Plaque       Between Patch and Plaque 0.94344640
2    sample9 sample70       Patch nonlesional  Between nonlesional and Patch 0.60253312
3    sample9 sample69       Patch       Patch                   Within Patch 0.62086228

How to remove redundant rows in a data.frame (by columns [1, 2] and vice versa)?

1 Answers1