-1

Please I don't know if there is an easy way to do this in R. I have 3 columns (child, father, mother) of about 5000 rows. I am trying to assign zero (0) to 25%, 50% and 75% in the column for father i.e. to assume that particular proportion is unknown (0). I want it to assign according to the number of rows for each father based on the total number of children for each father. In the data below, I would expect the script to substitute e.g. 25% of g, k, u and x with 0. Thanks


child   father mother
1          g      m1
2          g      m2
3          g      m1
4          g      m2
5          g      m1
6          g      m2
7          k      m1
8          k      m2
9          k      m1
10          k      m2
11          u      m1
12          u      m2
13          u      m1
14          u      m2
15          u      m1
16          x      m2
17          x      m1
18          x      m2
19          x      m1
20          x      m2
nolyugo
  • 1,451
  • 3
  • 12
  • 12

1 Answers1

1

This will look within each group of father and return a vector with 25% of the cases set to 0. Saving this over the top of the current variable will hopefully give you what you want:

test <- read.table(textConnection("child father mother
1 g  m1
2 g  m2
3 g  m1
4 g  m2
5 g  m1
6 g  m2
7 k  m1
8 k  m2
9 k  m1
10 k  m2
11 u  m1
12 u  m2
13 u  m1
14 u  m2
15 u  m1
16 x  m2
17 x  m1
18 x  m2
19 x  m1
20 x  m2"),
header=TRUE,stringsAsFactors=FALSE)

I round the 25% down to be conservative. floor could be replaced with round or ceil if appropriate.

test$father <- unlist(
    ave(test$father,test$father,
           FUN=function(x) {
               x[1:floor(length(x)*0.25)] <- 0
               x
           }
    )
)

The result:

test
> test
   child father mother
1      1      0     m1
2      2      g     m2
3      3      g     m1
4      4      g     m2
5      5      g     m1
6      6      g     m2
7      7      0     m1
8      8      k     m2
9      9      k     m1
10    10      k     m2
11    11      0     m1
12    12      u     m2
13    13      u     m1
14    14      u     m2
15    15      u     m1
16    16      0     m2
17    17      x     m1
18    18      x     m2
19    19      x     m1
20    20      x     m2
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • it is behaving funny with a csv file, it re-shuffles father column. I don't know why this is happening – nolyugo Sep 11 '12 at 18:14
  • @nolyugo - it's a sorting issue as `tapply` will return the groups of `father` in ascending order - I have made an edit to fix this by using `ave` instead. – thelatemail Sep 11 '12 at 20:19