Consolidate duplicate rows into one by applying a formula

Question

In R I want to consolidate rows where data points with the same x,y coordinates can be merged with a formula to give a single row representing the combined area values. (Multi-stemmed trees but the same plant with representative combined diameter or cross-sectional area)) So in this simple example of a data frame:

{x <- c(6, 6, 6, 2, 2, 3, 4, 4, 7, 8)
y <- c(6, 6, 6, 4, 3, 7, 4, 6, 6, 10)
diam <- c(12, 9, 7, 16, 19, 4, 7, 8, 9, 3)
forest <- tibble(x,y, diam)
ggplot(data = forest) +
geom_point(mapping = aes(x = x, y = y, size = diam))
}

What I want to do is isolate the duplicate x,y rows and reduce it to a single row representing the combined diameters, something like the mean but a bit more complicated (I can fill that in later).

I have read and studied all the posts here about removing duplicates, but I don't want to do that; I want to consolidate them, leaving a single row with a representative diameter or circular area for combined stems of the same plant.

It's easier to help you if you include a simple reproducible example: with sample input and desired output that can be used to test and verify possible solutions. — TarJae, Sep 19 '21 at 11:08
Code should minimal to the point and reproducible which implies, in particular, that all library statements should be included. Also it is not clear what you are asking. The text of the question refers to consolidation of the rows but the code includes a plot which would seem to be irrelevant or did you mean that the only reason you want this is to create a plot with one point for each unique pair of x and y coordinates? — G. Grothendieck, Sep 19 '21 at 11:47

Allan Cameron · Accepted Answer · 2021-09-19T13:15:33.250

2

If I am interpreting your question correctly, each row in your current data frame represents a measurement of diam at a particular location. There are a number of unique locations which are defined by their x, y values, but some of these locations have multiple rows in your data frame representing multiple measurements at the same site. You would like to be able to summarize the diam values at each unique location by taking each location's vector of diam measurements at each site and applying some function that returns a single value (such as sum or mean).

You can do this very easily with the dplyr package. You can group_by each unique location then summarize all the values of diam at each x, y location.

In the following example, I have used a simple sum of all the diameters, but you could change this to any function that takes a numeric vector as input and gives a single numeric output (such as max , mean, median etc):

library(dplyr)
library(ggplot2)

forest %>% 
  group_by(x, y) %>% 
  summarize(diam = sum(diam)) %>%
  ggplot() +
  geom_point(aes(x, y, size = diam))

EDIT

The function for finding a single equivalent diameter from several individual diameters would be:

sum_diams <- function(x) 2 * sqrt(sum((x / 2)^2))

So your code would become:

library(dplyr)
library(ggplot2)

sum_diams <- function(x) 2 * sqrt(sum((x / 2)^2))

forest %>% 
  group_by(x, y) %>% 
  summarize(diam = sum_diams(diam)) %>%
  ggplot() +
  geom_point(aes(x, y, size = diam))

FURTHER EDIT

To store the modified data frame, you can do:

new_forest <- forest %>% 
  group_by(x, y) %>% 
  summarize(diam = sum_diams(diam))

If you want to plot it, you can do:

ggplot(new_forest) +
  geom_point(aes(x, y, size = diam))

And if you want to analyze it further, your data frame new_forest is still in memory.

edited Sep 19 '21 at 13:15

answered Sep 19 '21 at 11:31

Allan Cameron

147,086
7
49
87

This is very helpful indeed, Allan, thank you. You assumptions are correct. So what I really really want to do is take the duplicates and apply a formula that essentially takes all the diameter values at each x,y point and turns them into a single combined (basal) area that would represent a hypothetical single stem tree on at each point. Hope that makes sense. So: diameter/2 = r. The pi * r^2. Then sum the areas. Then reverse back to a single representative diameter. These values would then sit nicely as single rows with the non duplicated x,y rows. Sorry if i'm asking too much ! – David Cracknell Sep 19 '21 at 12:44
1

@Newshound68 see my update. – Allan Cameron Sep 19 '21 at 12:54
Legend, thank so much Allan – David Cracknell Sep 19 '21 at 12:56
Final ask @Allan Cameron: as well as creating the plot, can you add a line to show me how to save this data frame with the new unique rows with the calculated representative diameters for further statistical analysis. Thanks in advance – David Cracknell Sep 19 '21 at 13:09
1

@Newshound68 sure, just store the data frame before feeding it into ggplot as per my newest update. – Allan Cameron Sep 19 '21 at 13:16
Thanks Allan. Hope it's not raining too much in Glasgow today. I will mark this answered. I really hope this question is not deleted once I do that like last time. There are lots of "newbies" out there who will find this highly useful. Thanks again – David Cracknell Sep 19 '21 at 13:25
1

@Newshound68 It will be more helpful to future visitors if you edit your new requirements into your question in case these comments are lost. – Ian Campbell Sep 19 '21 at 15:30
@AllanCameron This example works great for three columns. What do I do if I have up to 15 other columns in the data frame? Once I've consolidated the x,y duplicates I obviously have fewer rows and now I cannot cbind() with the rest of the data. Is there a way of carrying *all* the columns with the process of consolidating duplicate x,y as above? – David Cracknell Sep 25 '21 at 13:10

jay.sf · Answer 2 · 2021-09-19T14:06:55.967

Here is a way using base pipes. sum_diams() borrowed with thanks from @Allan Cameron's answer.

For the legend I use a small helper function mk().

mk <- \(x, f=5) {o <- unique(round(min(x):max(x)/f))*f;o[o > 0]}

forest |>
  with(aggregate(list(diam=diam), list(x=x, y=y), FUN=sum_diams)) |>
  {\(x) new_forest <<- x}() |>
  with(plot(x, y, pch=20, cex=diam/6, main="Forest")) |>
  with(legend('topleft', legend=mk(diam), title='diam', pch=20, pt.cex=mk(diam)/6))

new_forest is stored in between.

new_forest
#   x  y diam
# 1 2  3   19
# 2 2  4   16
# 3 4  4    7
# 4 4  6    8
# 5 6  6   28
# 6 7  6    9
# 7 3  7    4
# 8 8 10    3

Note: new_forest will be overwritten if it exists before.

Consolidate duplicate rows into one by applying a formula

2 Answers2