How to remove rows from a table that exist in another table?

Question

I have a table which consists of the best tennis players by ranking in the last 30 years. Example is below (height is not important for my question):

> head(mostSuccesfulPlayers)
    player_id    player_name player_ht
1      105453  Kei Nishikori       178
31     104925 Novak Djokovic       188
59     104731 Kevin Anderson       203
152    104745   Rafael Nadal       185
164    105227    Marin Cilic       198
172    103819  Roger Federer       185

I have another table which consists of basically all of the players in the data set, for example:

> head(allPlayers)
  player_id     player_name player_ht
1    100284   Jimmy Connors       178
2    100431 Mansour Bahrami       178
3    100529    Kevin Curren       185
4    100532     Johan Kriek       175
5    100553    Nduka Odizor       183
6    100581    John McEnroe       180

I want to create a table notSoSuccesfulPlayers which would consist of only the players from the allPlayers table, but without the players which are listed in the mostSuccesfulPlayers table. In the set theory notation, that would be allPlayers \ mostSuccesfulPlayers.

How can I do that? Huge thanks for any help in advance!

That sounds like an anti-join. Your two example tables don't have any overlapping players, though, so any suggestions people give based on those won't actually capture the task — camille, Dec 29 '21 at 16:12
@camille I just now realised that the heads of tables I displayed here contain all different players, but there were 110 players which were included in the `mostSuccessfulPlayers` table, and also in the `allPlayers` table. Thank you for pointing that out! :) — Iva, Dec 29 '21 at 16:49

score 1 · Answer 1 · answered Dec 29 '21 at 16:10

1

Easiest and native R way would be function setdiff:

df = iris

df2 = df[1:100,]

setdiff(df, df2)

In your case

notSoSuccesfulPlayers  = setdiff(allPlayers, mostSuccesfulPlayers)

answered Dec 29 '21 at 16:10

Marco_CH

3,243
8
25

score 1 · Accepted Answer · answered Dec 29 '21 at 16:14

1

An alternative to the canonical setdiff:

subset(allPlayers, !player_id %in% mostSuccessfulPlayers$player_id)

Note: this works well when you can uniquely id each row by exactly one column; for multiple columns, you'll need setdiff (or dplyr::anti_join).

answered Dec 29 '21 at 16:14

r2evans

141,215
6
77
149

Thank you! This worked as it was possible to distinguish rows by `player_id`. – Iva Dec 29 '21 at 16:47

How to remove rows from a table that exist in another table?

2 Answers2