-1

to explain my problem, I have created the following df:

hh_01 <- c(rep(1:4, each = 3), rep(5:10, each = 5))
vill <- c(rep(100, 12), rep(101, 30))
hh_02 <- c(2:4, 1, 3, 4, 1:2, 4, 1:3, 6:10, 5, 7:10, 5:6, 8:10, 5:7, 9:10, 5:8, 10, 5:9)
set.seed(1); dist <- abs(rnorm(42, mean = 0, sd = 1000))
df <- matrix(c(hh_01, vill, hh_02, dist), nrow = 42, ncol = 4)
colnames(df) <- c("hh_01", "vill", "hh_02", "dist")
df <- as.data.frame(df)
df
   hh_01 vill hh_02       dist
1      1  100     2 1728.39791
2      1  100     3  979.05280
3      1  100     4  972.09301
4      2  100     1  461.72457
5      2  100     3  384.84236
6      2  100     4  523.10665
7      3  100     1  482.88891
8      3  100     2  218.27501
9      3  100     4  878.32424
10     4  100     1   41.75679
11     4  100     2  967.72103
12     4  100     3  661.80881
13     5  101     6  851.74364
14     5  101     7  852.48595
15     5  101     8  471.51824
16     5  101     9  862.90742
17     5  101    10  750.57410
18     6  101     5 1714.03797
19     6  101     7   93.43975
20     6  101     8  640.15912
21     6  101     9  601.66437
22     6  101    10  969.44271
23     7  101     5   77.95871
24     7  101     6  604.71114
25     7  101     8  169.18386
26     7  101     9  435.42663
27     7  101    10  604.22278
28     8  101     5  475.18935
29     8  101     6   13.09895
30     8  101     7 2873.04565
31     8  101     9 1019.03810
32     8  101    10   41.51445
33     9  101     5  914.63453
34     9  101     6   67.62432
35     9  101     7   85.45653
36     9  101     8  971.21044
37     9  101    10 2074.87280
38    10  101     5   98.43913
39    10  101     6  437.63773
40    10  101     7  620.47573
41    10  101     8  376.56226
42    10  101     9 1013.93106

My task: for all hh_01 with the same value calculate the mean of dist and save the result in a new df with the following structure:

hh_01  vill  mean_dist
1      100   1226.515
2      100   .......

I know I have to use the for loop (or maybe alternatively sapply/lapply or ) but I don´t know how to finish this command...

for (i in seq(along=df[,df$hh_01])){
  ifelse(df$hh_01[i] == df$hh_01[i+1])
}

I know these are basics in programming (not just in R) but i´m not a programmer and pretty new in this area...) I would appreciate any help. The simpler the code the better for me (please with short explanation). I would like to understand this kind of looping (or looping in general) because I have to work with this type of questions very often in the future. Thank you very much.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Mapos
  • 177
  • 1
  • 9
  • good that you provided sample data, but it is based on random generation of values, so always changing and thus not reproducible. Use something like `set.seed=1234` to provide a constant result – Andrew Lavers May 27 '17 at 21:17
  • changed. thanks for your note. – Mapos May 27 '17 at 21:33

3 Answers3

1

You can also use aggregate:

dfnew<-aggregate(df[c("hh_01","vill","dist")],by=list(df$hh_01),mean)[-1]
Bea
  • 1,110
  • 12
  • 20
0

Here is a version using dplyr package - although I get a different result from you. One of the important characteristics of R is that many functions are vectorized which loosely means they can operate on a whole structure without having to use a for or apply construct (the for or apply is hidden within the function). Note also the simplified way to create a dataframe.

set.seed = 123
df <- data.frame(
  hh_01 = c(rep(1:4, each = 3), rep(5:10, each = 5)),
  vill = c(rep(100, 12), rep(101, 30)),
  hh_02 = c(2:4, 1, 3, 4, 1:2, 4, 1:3, 6:10, 5, 7:10, 5:6, 8:10, 5:7, 9:10, 5:8, 10, 5:9),
  dist = abs(rnorm(42, mean = 0, sd = 1000))
)



library(dplyr)
df2 <- df %>%
  group_by(hh_01, vill) %>%
  summarize(mean_dist = mean(dist))
df2

#    hh_01  vill mean_dist
#   < int> <dbl>     <dbl>
# 1      1   100 1265.9534
# 2      2   100  855.2477
# 3      3   100  840.0750
# 4      4   100  876.0722
# 5      5   101  574.8193
# 6      6   101  559.2385
# 7      7   101 1177.1751
# 8      8   101  765.6921
# 9      9   101  438.8936
# 10    10   101  331.3354
Andrew Lavers
  • 4,328
  • 1
  • 12
  • 19
0

The dplyr package is great help here.

library(dplyr)

new_df <- group_by(df, hh_01, vill)
new_df <- summarize(new_df, mean_dist=mean(dist))

Example output:

   hh_01  vill mean_dist
   <dbl> <dbl>     <dbl>
1      1   100  666.0538
2      2   100  720.5532

A great dplyr cheatsheet is found here: http://nbviewer.jupyter.org/github/rstudio/cheatsheets/blob/master/source/pdfs/data-transformation-cheatsheet.pdf

summarize is a vectorized function - summarize takes care of efficient looping for you.