3

I have two tables one with more rows than the other. I would like to filter the rows out that both tables share. I tried the solutions proposed here.

The problem, however, is that it is a large data-set and computation takes quite a while. Is there any simple solution? I know how to extract the shared rows of both tables using:

rownames(x1)->k
rownames(x)->l
which(rownames(x1)%in%l)->o

Here x1 and x are my data frames. But this only provides me with the shared rows. How can I get the unique rows of each table to then exclude them respectively? So that I can just cbind both tables together?

Community
  • 1
  • 1
Tim Heinert
  • 179
  • 3
  • 6
  • 11
  • 4
    Have you already tried to `merge()` these data frames? Also, if the data frames are large, you could use package `data.table` to perform very quick join operations. – Andrie Jul 26 '12 at 09:45
  • You can try a shell script e.g: put one on top of the other, `sort`, and then `uniq -d`. – DiscreteCircle Jul 26 '12 at 09:51
  • you already have the rownames of the shared rows, are you asking how to extract those rows? – RJ- Jul 26 '12 at 09:52
  • so i do have the row names and want to extract the row names whcih are shared between both tables. They have a unqual number of rows and columns – Tim Heinert Jul 27 '12 at 08:33

2 Answers2

2

(I edit the whole answer) You can merge both df with merge() (from Andrie's comment). Also check ?merge to know all the options you can put in as by parameter, 0 = row.names.

The code below shows an example with what could be your data frames (different number of rows and columns)

x = data.frame(a1 = c(1,1,1,1,1), a2 = c(0,1,1,0,0), a3 = c(1,0,2,0,0), row.names = c('y1','y2','y3','y4','y5'))
x1 = data.frame(a4 = c(1,1,1,1), a5 = c(0,1,0,0), row.names = c('y1','y3','y4','y5'))

Provided that row names can be used as identifier then we put them as a new column to merge by columns:

x$id <- row.names(x)
x1$id <- row.names(x1)

# merge by column names
merge(x, x1, by = intersect(names(x), names(x1)))

# result
#   id a1 a2 a3 a4 a5
# 1 y1  1  0  1  1  0
# 2 y3  1  1  2  1  1
# 3 y4  1  0  0  1  0
# 4 y5  1  0  0  1  0

I hope this solves the problem.

EDIT: Ok, now I feel silly. If ALL columns have different names in both data frames then you don't need to put the row name as another column. Just use:

merge(x,x1, by=0)
julia
  • 152
  • 1
  • 7
  • hey, thansk for the reply. The solution sounds really simple but...what do you exactly mean with the EDIT post? The two data frames have differing number of columns. So why are they multiplied? So do you mean that if x1 hat 5 and x had 4 columns the new merged table would not haev 9 but 18? – Tim Heinert Jul 27 '12 at 08:41
  • is there any way to preserve the initial identifier column when using merge? So my first column in the initial table is the one with the rownames but as soon as i merged the two table only numbers from 1 to n are in the first colum and the identifiers are in the second column – Tim Heinert Jul 27 '12 at 11:57
  • I guess if instead of merging by `row.names` using some kind of id as column instead of row name (see edit2) – julia Jul 30 '12 at 12:14
  • Ok, I also understood the columns were shared, now it's fixed with an example of data frames that share some rows but have different column names (it would work also for shared columns, but make sure same column name and same row name have the same entry for both df) – julia Jul 30 '12 at 12:49
  • This code does not produce the correct output. Reason: dataframe **'x' has 5 rows** while **'x1' has only 4 rows** - resulting in the **FINAL** dataframe only having **4 rows**. This means that the two dataframes were not merged correctly. The correct answer is: `merge(x, x1, by=intersect(names(x), names(x1)), all=T)` as this combines all the rows from both dataframes into a single dataframe. Then you only have to decide what to do with all the 'NA' values - e.g. convert them to zeros etc. – SilSur May 26 '17 at 10:10
0

If you only want the rows which are not repeated from each data set:

rownames(x1)->k
rownames(x)->l
which(k%in%l) -> o
x1.uniq <- x1[k[k != o],];
x.uniq <- x[l[l != o],];

And then you can join them with rbind:

x2 <- rbind(x1.uniq,x.uniq);

If you also wanted the repeated rows you can add them:

x.repeated <- x1[o];
x2 <- rbind(x2,x.repeated);
DiscreteCircle
  • 684
  • 5
  • 7
  • hey..your answer is really nice. There is one problem, however, when I enter x1.uniq <- x1[k[k != o],]; x.uniq <- x[l[l != o],]; there is a warning "in k!=0 : longer object length is not a multiple of shorter object length. And when I enter the x2 <- rbind(x1.uniq,x.uniq); it puts out an error (number of columns of matrices must match) – Tim Heinert Jul 27 '12 at 08:30
  • I was assuming that you had two data sets with the same column attributes but with overlapping row populations. Can you clarify what your data sets look like? – DiscreteCircle Jul 27 '12 at 12:38
  • so the row number of the data sets is different but the contain some overlaps - say data set 1 has 50 rows and 2 has 40 rows and 30 of them are shared. Data set one has 4 colums and data set 2 has 10 columns – Tim Heinert Jul 27 '12 at 20:37