0

I have a file like this in R.

**0 1** 
0   2
**0 3**
0   4
0   5
0   6
0   7
0   8
0   9
0   10
**1 0**
1   11
1   12
1   13  
1   14
1   15
1   16
1   17
1   18
1   19
**3 0**

As we can see, there are similar unordered pairs in this ( marked pairs ), like,

 1  0 

and

 0  1

I wish to remove these pairs. And I want to count the number of such pairs that I have and append the count in front of the tow that is repeated. If not repeated, then 1 should be written in the third column.

For example ( A sample of the output file )

0   1    2
0   2    1
0   3    2
0   4    1 
0   5    1
0   6    1
0   7    1
0   8    1
0   9    1
0   10   1
1   11   1
1   12   1
1   13   1
1   14   1
1   15   1
1   16   1
1   17   1
1   18   1
1   19   1

How can I achieve it in R?

user3797829
  • 383
  • 3
  • 15

2 Answers2

4

Here is a way using transform, pmin and pmax to reorder the data by row, and then aggregate to provide a count:

# data
x <- data.frame(a=c(rep(0,10),rep(1,10),3),b=c(1:10,0,11:19,0))

#logic
aggregate(count~a+b,transform(x,a=pmin(a,b), b=pmax(a,b), count=1),sum)
   a  b count
1  0  1     2
2  0  2     1
3  0  3     2
4  0  4     1
5  0  5     1
6  0  6     1
7  0  7     1
8  0  8     1
9  0  9     1
10 0 10     1
11 1 11     1
12 1 12     1
13 1 13     1
14 1 14     1
15 1 15     1
16 1 16     1
17 1 17     1
18 1 18     1
19 1 19     1
James
  • 65,548
  • 14
  • 155
  • 193
  • 1
    Give it some "data.table" power with a slight rewrite: `DT[, list(a = pmin(a, b), b = pmax(a, b), count = 1)][, .N, by = list(a, b)]` :-) – A5C1D2H2I1M1N2O1R2T1 Jul 14 '14 at 19:02
  • 1
    @AnandaMahto no need for that intermediate step: `DT[, .N, by = list(a = pmin(a, b), b = pmax(a, b))]` – eddi Jul 14 '14 at 19:06
  • @eddi, even nicer :-) – A5C1D2H2I1M1N2O1R2T1 Jul 14 '14 at 19:07
  • @AnandaMahto Nice. I struggled with a `dplyr` solution, as it seems there is a conceptual difference with it I didn't understand. I think I have it now. – James Jul 14 '14 at 19:07
  • This is a cool solution, but it's very special for the two variables case. On the other hand, Ananda's answer does it more generally but does so at the cost of converting to a matrix, which is not great. – eddi Jul 14 '14 at 19:09
2

Here's one approach:

First, create a vector of the columns sorted and then pasted together.

x <- apply(mydf, 1, function(x) paste(sort(x), collapse = " "))

Then, use ave to create the counts you are looking for.

mydf$count <- ave(x, x, FUN = length)

Finally, you can use the "x" vector again, this time to detect and remove duplicated values.

mydf[!duplicated(x), ]
#    V1 V2 count
# 1   0  1     2
# 2   0  2     1
# 3   0  3     2
# 4   0  4     1
# 5   0  5     1
# 6   0  6     1
# 7   0  7     1
# 8   0  8     1
# 9   0  9     1
# 10  0 10     1
# 12  1 11     1
# 13  1 12     1
# 14  1 13     1
# 15  1 14     1
# 16  1 15     1
# 17  1 16     1
# 18  1 17     1
# 19  1 18     1
# 20  1 19     1
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485