0

I have a DataFrame consisting of almost 200,000 Observations. The Important columns are similar to (NOTE: The Min Value is not filled by default)

Key        Value      Min Value
1xA         1           1
2xA         2           2
3xB         3           2 
1xB         1           1
1xA         5           1 
2xB         2           2
3xB         2           2
2xA         4           2

How can achieve this to fill the Min Value column in R? I tried looping in the dataframe and finding minimum of a subset created for key while checking that particular observation, and it worked fine. But it takes too much time to carry out the 200,000 Observation iteration.

My Thoughts: I was thinking if there was any way to filter out using which() for every observation in one go so that I can find the min. then and there?

Wenfang Du
  • 8,804
  • 9
  • 59
  • 90
d1r3w0lF
  • 85
  • 1
  • 11

2 Answers2

0

Using dplyr, this is not too complicated:

df
#    Key Value
# 1 1xA     1
# 2 2xA     2
# 3 3xB     3
# 4 1xB     1
# 5 1xA     5
# 6 2xB     2
# 7 3xB     2
# 8 2xA     4

df %>% 
    dplyr::group_by(Key) %>% 
    dplyr::mutate(Min.Value = min(Value))
# A tibble: 8 x 3
# Groups:   Key [5]
#   Key   Value Min.Value
#   <fct> <int>     <dbl>
# 1 1xA       1         1
# 2 2xA       2         2
# 3 3xB       3         2
# 4 1xB       1         1
# 5 1xA       5         1
# 6 2xB       2         2
# 7 3xB       2         2
# 8 2xA       4         2
C. Braun
  • 5,061
  • 19
  • 47
0

With basic functions only:

merge(df, aggregate(list(Min.Value=df$Value), by=list(key=df$Key), FUN=min), by="Key")

The aggregate function takes the Value column (the first parameter), groups it by the values supplied with the by parameter (must be a list of vectors of the same length as the fist parameter), and applies the FUN function to each group (here min). The first parameter is wrapped to a list just to rename the resulting column (if you replace it with just df$Value, the resulting column will be named x.)

The outer merge joins rows of the two data frames supplied that have the same value in the Keycolumn (i.e. much like database join operation).

Jaromír Adamec
  • 579
  • 4
  • 13
  • The basic problem with using merge is that it follows the iterative way, hence quite slow. 200,000+ data set will take me ages. – d1r3w0lF Jun 01 '18 at 19:09