1

I have a dataframe that consists of 3 cols:

tweet_id | response_id | time      
1           2            22:10:47
2           NA           22:10:13
3           1            22:08:27 
4           3            21:54:49
5           4            21:49:35
6           5            21:46:23
6           7            21:46:23
8           9            21:30:45
8           6            21:30:45
8           10           21:30:45

I want to compare tweet_id to response_id - but I want to iterate through all of tweet_id as a reference and see if response_id value is present in tweet_id. IF NOT - remove the row.

You'll notice duplicates in tweet_id and time - ideally they would be unique based on this logic.

Tried filter(tweet_id != response_tweet_id) but did not work.

Desired output:

tweet_id | response_id | time      
1           2            22:10:47        
3           1            22:08:27 
4           3            21:54:49
5           4            21:49:35
6           5            21:46:23
8           6            21:30:45
Dinho
  • 704
  • 4
  • 15
  • @RonakShah The logic I want to implement is because 6 in response_id is present in tweet_id - we selected that row - instead of response_id '9' and '10' – Dinho Sep 18 '21 at 03:42
  • @RonakShah Im sorry - I was missing a row - updated question – Dinho Sep 18 '21 at 03:45

3 Answers3

1

Appears that you want only the rows within groupings of tweet_id that have the lowest values of response_id. (I was reasoning off the desired output rather than your description which I couldn't follow.)

dat=dat[ order(dat$tweet_id, dat$response_id), ]  $sort by both id cols
dat[ !duplicated(dat[1]),  ]   # pick the first tweet_id in each group of tweet_ids because its the NOT-duplicated one.
  tweet_id response_id     time
1        1           2 22:10:47
2        3           1 22:08:27
3        4           3 21:54:49
4        5           4 21:49:35
5        6           5 21:46:23
8        8           6 21:30:45
IRTFM
  • 258,963
  • 21
  • 364
  • 487
1

You may use %in% to select rows where response_id is present in tweet_id.

subset(df, response_id %in% unique(tweet_id))

#  tweet_id response_id     time
#1        1           2 22:10:47
#2        3           1 22:08:27
#3        4           3 21:54:49
#4        5           4 21:49:35
#5        6           5 21:46:23
#6        8           6 21:30:45

If you want to use dplyr

library(dplyr)
df %>% filter(response_id %in% unique(tweet_id))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

Using data.table.

library(data.table)

data <- 
fread('tweet_id response_id time      
1           2            22:10:47
2           NA           22:10:13
3           1            22:08:27 
4           3            21:54:49
5           4            21:49:35
6           5            21:46:23
6           7            21:46:23
8           9            21:30:45
8           6            21:30:45
8           10           21:30:45') 

data[response_id %in% unique(tweet_id)] 
#>    tweet_id response_id     time
#> 1:        1           2 22:10:47
#> 2:        3           1 22:08:27
#> 3:        4           3 21:54:49
#> 4:        5           4 21:49:35
#> 5:        6           5 21:46:23
#> 6:        8           6 21:30:45

Created on 2021-09-18 by the reprex package (v2.0.0)

jpdugo17
  • 6,816
  • 2
  • 11
  • 23