looping based on same values in one column and storing the results in a new dataframe

Question

to explain my problem, I have created the following df:

hh_01 <- c(rep(1:4, each = 3), rep(5:10, each = 5))
vill <- c(rep(100, 12), rep(101, 30))
hh_02 <- c(2:4, 1, 3, 4, 1:2, 4, 1:3, 6:10, 5, 7:10, 5:6, 8:10, 5:7, 9:10, 5:8, 10, 5:9)
set.seed(1); dist <- abs(rnorm(42, mean = 0, sd = 1000))
df <- matrix(c(hh_01, vill, hh_02, dist), nrow = 42, ncol = 4)
colnames(df) <- c("hh_01", "vill", "hh_02", "dist")
df <- as.data.frame(df)
df
   hh_01 vill hh_02       dist
1      1  100     2 1728.39791
2      1  100     3  979.05280
3      1  100     4  972.09301
4      2  100     1  461.72457
5      2  100     3  384.84236
6      2  100     4  523.10665
7      3  100     1  482.88891
8      3  100     2  218.27501
9      3  100     4  878.32424
10     4  100     1   41.75679
11     4  100     2  967.72103
12     4  100     3  661.80881
13     5  101     6  851.74364
14     5  101     7  852.48595
15     5  101     8  471.51824
16     5  101     9  862.90742
17     5  101    10  750.57410
18     6  101     5 1714.03797
19     6  101     7   93.43975
20     6  101     8  640.15912
21     6  101     9  601.66437
22     6  101    10  969.44271
23     7  101     5   77.95871
24     7  101     6  604.71114
25     7  101     8  169.18386
26     7  101     9  435.42663
27     7  101    10  604.22278
28     8  101     5  475.18935
29     8  101     6   13.09895
30     8  101     7 2873.04565
31     8  101     9 1019.03810
32     8  101    10   41.51445
33     9  101     5  914.63453
34     9  101     6   67.62432
35     9  101     7   85.45653
36     9  101     8  971.21044
37     9  101    10 2074.87280
38    10  101     5   98.43913
39    10  101     6  437.63773
40    10  101     7  620.47573
41    10  101     8  376.56226
42    10  101     9 1013.93106

My task: for all hh_01 with the same value calculate the mean of dist and save the result in a new df with the following structure:

hh_01  vill  mean_dist
1      100   1226.515
2      100   .......

I know I have to use the for loop (or maybe alternatively sapply/lapply or ) but I don´t know how to finish this command...

for (i in seq(along=df[,df$hh_01])){
  ifelse(df$hh_01[i] == df$hh_01[i+1])
}

I know these are basics in programming (not just in R) but i´m not a programmer and pretty new in this area...) I would appreciate any help. The simpler the code the better for me (please with short explanation). I would like to understand this kind of looping (or looping in general) because I have to work with this type of questions very often in the future. Thank you very much.

good that you provided sample data, but it is based on random generation of values, so always changing and thus not reproducible. Use something like `set.seed=1234` to provide a constant result — Andrew Lavers, May 27 '17 at 21:17

score 1 · Answer 1 · answered May 27 '17 at 21:07

1

You can also use aggregate:

dfnew<-aggregate(df[c("hh_01","vill","dist")],by=list(df$hh_01),mean)[-1]

answered May 27 '17 at 21:07

Bea

1,110
12
20

Andrew Lavers · Answer 2 · 2017-05-27T21:23:26.773

Here is a version using dplyr package - although I get a different result from you. One of the important characteristics of R is that many functions are vectorized which loosely means they can operate on a whole structure without having to use a for or apply construct (the for or apply is hidden within the function). Note also the simplified way to create a dataframe.

set.seed = 123
df <- data.frame(
  hh_01 = c(rep(1:4, each = 3), rep(5:10, each = 5)),
  vill = c(rep(100, 12), rep(101, 30)),
  hh_02 = c(2:4, 1, 3, 4, 1:2, 4, 1:3, 6:10, 5, 7:10, 5:6, 8:10, 5:7, 9:10, 5:8, 10, 5:9),
  dist = abs(rnorm(42, mean = 0, sd = 1000))
)



library(dplyr)
df2 <- df %>%
  group_by(hh_01, vill) %>%
  summarize(mean_dist = mean(dist))
df2

#    hh_01  vill mean_dist
#   < int> <dbl>     <dbl>
# 1      1   100 1265.9534
# 2      2   100  855.2477
# 3      3   100  840.0750
# 4      4   100  876.0722
# 5      5   101  574.8193
# 6      6   101  559.2385
# 7      7   101 1177.1751
# 8      8   101  765.6921
# 9      9   101  438.8936
# 10    10   101  331.3354

score 0 · Answer 3 · answered May 27 '17 at 21:00

The dplyr package is great help here.

library(dplyr)

new_df <- group_by(df, hh_01, vill)
new_df <- summarize(new_df, mean_dist=mean(dist))

Example output:

   hh_01  vill mean_dist
   <dbl> <dbl>     <dbl>
1      1   100  666.0538
2      2   100  720.5532

A great dplyr cheatsheet is found here: http://nbviewer.jupyter.org/github/rstudio/cheatsheets/blob/master/source/pdfs/data-transformation-cheatsheet.pdf

summarize is a vectorized function - summarize takes care of efficient looping for you.

looping based on same values in one column and storing the results in a new dataframe

3 Answers3