Identify duplicated rows based on two columns and reassign

Question

So I have the following data frame with columns "Longitude", "Latitude" and "Prob", for probability. Prob is always either 0, 1 or 2 (note you can't see a row with Prob=2 in the example below).

    Longitude  Latitude Prob
1    28.00403 -26.50431    1
2    28.01236 -26.50431    1
3    28.02069 -26.50431    1
4    28.02903 -26.50431    0
5    28.03736 -26.50431    0
6    28.04569 -26.50431    0
7    28.05403 -26.50431    0
8    28.06236 -26.50431    1
9    28.07069 -26.50431    1

What I want to do is identify rows that have the same Longitude and Latitude values. If, for example, rows 5 and 7 show the same Longitude and Latitude values, and row 5 has Prob 1 and row 7 has probability 2, I want to remove one of those rows and make Prob equal 3 for the row that remains.

I.e., I want to select duplicated rows based on two columns, Longitude and Latitude. If two rows have the same longitude and latitude values, and if they show Prob 1 and 2, I want to "merge" these rows and make Prob 3.

I hope this makes sense. Thanks.

Welcome to SO. Please paste data into the questions using the output of `dput(your_dataframe)` this makes it easier to help you by making the data available with a simple copy. Unlike the current format. [Link for guidance about asking questions] (https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — Peter, Mar 25 '22 at 11:44

score 2 · Answer 1 · answered Mar 25 '22 at 12:05

2

If I understood you correctly, you just want to group by Long, Lat and take the sum of Prob.

df %>%
  group_by(Longitude, Latitude) %>%
  summarise(Prob = sum(Prob))

answered Mar 25 '22 at 12:05

Merijn van Tilborg

5,452
1
7
22

Claudio Paladini · Answer 2 · 2022-03-25T11:52:53.510

You should try to solve the problem and post your challenges here, rather than just the starting question. This page might also help.

Anyway, something like this might work:

#fake data
df = data.frame(
  x = c(1,2,3,3,5),
  y = c(2,3,3,3,4),
  prob = rep('a',5)
)

# find duplicates
index_both = duplicated(df) | duplicated(df, fromLast=TRUE)
index_drop = duplicated(df)

# assign new value
df[index_both,'prob'] = 'b'

# drop duplicate
df = df[!index_drop,]

score 0 · Answer 3 · answered Mar 26 '22 at 04:29

Interesting use case!! (difficult to understand in first reading)

If you want to see only "duplicated" rows , dplyr's n() in summarise gives count of every grouped variable(s) and you can filter any with more than 1 value:

df %>%
  group_by(Longitude, Latitude) %>%
  summarise(N=n())%>%filter(N>1)

And if you want the summation of Prob column , then as suggested by Merijn van Tilborg simply sum(Prob).

But if u want what's mentioned in last line of your post, then slightly more work needed->

your question has more than just appears (slightly ambiguous too). With "AND of only to merge where Prob is 1 and 2" "If two rows have the same longitude and latitude values, and if they show Prob 1 and 2, I want to "merge" these rows and make Prob 3."

then we need to combine the two: some data similar to yours:

dx <- data.table(
  long = c('a','b','d1','c','a','d1','e','f','a')
  ,lat = c('a','b1','d1','c','a','d1','e','f','g')
  ,Prob = c(1,0,2,1,0,1,0)

9 records

  long lat Prob
1:    a   a    1
2:    b  b1    0
3:   d1  d1    2
4:    c   c    1
5:    a   a    0
6:   d1  d1    1
7:    e   e    0
8:    f   f    1
9:    a   g    0

Now in 9 records 1st and 5th are same but different Probs; Not 1 AND 2 but 3rd and 6th have 1 AND 2, which can be filtered as !=0; So you want to merger only 3rd and 6th BUT not 1st and 5th and hence output shud have 8 records ! 1st approach : with error (if you merge commons-> then it would be 7 with simple sum of Prob!=0 like this)

dx%>%
  group_by(long,lat)%>%
  summarise(N=n()
            ,Probs = sum(Prob[Prob!=0])
            )

yields 7 records merging 1st and 5th (a,a):

  long  lat       N Probs
  <chr> <chr> <int> <dbl>
1 a     a         2     1
2 a     g         1     0
3 b     b1        1     0
4 c     c         1     1
5 d1    d1        2     3
6 e     e         1     0
7 f     f         1     1

But if you dont want to do that: then grouping may not be required. (Generally that's first thing comes to mind for dplyr users like me)

2nd idea (it works) :

get the data with Prob==3 first and then add it to original data after removing the eligible rows.

p3 <- dx%>%
  group_by(long,lat)%>%
  summarise(Prob = sum(Prob[Prob!=0])
            )%>%filter(Prob==3)

yields

  long  lat    Prob
  <chr> <chr> <dbl>
1 d1    d1        3

now your desired output :

dx%>%
  arrange(long)%>%
  filter(!(long %in% p3$long & lat %in% p3$lat))%>%
  bind_rows(p3%>%select(long,lat,Prob))

as

   long lat Prob
1:    a   a    1
2:    a   a    0
3:    a   g    0
4:    b  b1    0
5:    c   c    1
6:    e   e    0
7:    f   f    1
8:   d1  d1    3

please try and let me know if it works.

Identify duplicated rows based on two columns and reassign

3 Answers3

But if u want what's mentioned in last line of your post, then slightly more work needed->

2nd idea (it works) :