How to find max value from a column based on a clustered output column in r

Question

I have a data frame as shown below

       X       Y      Z          cluster
245 256882.0 4110945 426.50          20
246 256882.7 4110945 426.42          57
247 256883.9 4110945 429.30         114
248 256884.6 4110945 428.93         114
249 256885.4 4110945 429.50          98
250 256886.1 4110945 429.67          33

The dataframe is having 4 columns with x, y, z and clustered output. xy are the coordinates and z is the corresponding height. I clustered the entire data points using kmeans into 176 clusters. Now I want to take max z value from each cluster. for example, from cluster value 1, I need to identify the max z value and need to take the corresponding x and y values too. How can I do that?

Please don't post images of data, they are beyond useless for copying and pasting and answering your question. — thelatemail, Mar 31 '16 at 04:36
sorry about that. what should I do? should I upload the dataset? — bibinwilson, Mar 31 '16 at 04:41
You could include `head(data)`, i.e. a small sample of the data. :) — Therkel, Mar 31 '16 at 04:45
Just copy and paste a few rows you have shown in your screenshot as text, or even better, just do `dput(head(datasetname))` and paste the result here — thelatemail, Mar 31 '16 at 04:46
structure(list(X = c(256882.03, 256882.74, 256883.91, 256884.57, 256885.37, 256886.11), Y = c(4110944.98, 4110944.96, 4110944.88, 4110944.87, 4110944.83, 4110944.81), Z = c(426.5, 426.42, 429.3, 428.93, 429.5, 429.67), fit.cluster = c(20L, 57L, 114L, 114L, 98L, 33L)), .Names = c("X", "Y", "Z", "fit.cluster"), row.names = 245:250, class = "data.frame") — bibinwilson, Mar 31 '16 at 04:48
Beware that X,Y,Z in your data have very different *scale*. k-means does not work well on such data. — Has QUIT--Anony-Mousse, Apr 02 '16 at 14:44
@Anony-Mousse what should I do? I'm trying to cluster the trees. The given is a LiDAR data. I did classification to the whole lidar point cloud and took the required species points only. which algo should I use to get the clustering? — bibinwilson, Apr 05 '16 at 05:00
It's not so much a question of choosing an algoeithm, but of choosing the right **preprocessing**. — Has QUIT--Anony-Mousse, Apr 05 '16 at 06:01
Not just that. *Much* more than that. These are not random numbers - you need to know what they are, and how to make them comparable. They maybe aren't X,Y,Z in a 3D space, but pitch, yaw, distance. Then you must not treat them as Euclidean coordinates. That's why your clusters probably are all over the place. — Has QUIT--Anony-Mousse, Apr 05 '16 at 08:03

vitor · Accepted Answer · 2016-03-31T05:25:35.133

1

You can use dplyr:

library(dplyr)

data %>%
  group_by(fit.cluster) %>%
  summarise(Z = max(Z)) %>%
  inner_join(data)

or:

df %>% 
  group_by(fit.cluster) %>%
  filter(Z == max(Z))

edited Mar 31 '16 at 05:25

answered Mar 31 '16 at 05:02

vitor

1,240
2
13
27

1

I was going to answer with something far less simple, this is a great way to handle this problem. But given you know the max Z within each cluster, how do you recover the X and Y associated with the Z? – Brian Albert Monroe Mar 31 '16 at 05:15
`group_by(fit.cluster) %>% slice(which.max(Z))` maybe? I don't use dplyr often but I think that might work to prevent the need to join back again. – thelatemail Mar 31 '16 at 05:25
I edited the code offering a solution that avoids joining back data. – vitor Mar 31 '16 at 05:27

How to find max value from a column based on a clustered output column in r

1 Answers1