1

What's the right way to do in R:

for(row in 1:10)
{
 counts[row] <- length(otherData[otherData[["some property"]] == otherList[row],])
}

In other words, put into each row of a new anything (matrix, data.frame, whatever) the count of those rows in another anything (matrix, data.frame, whatever) that equal the corresponding entry in some other list (again abstractly speaking, not literally list object)?

E.g. say x = otherData is

   a   b   c
d  1   2   3
e  1   3   4
f  2   5   6
g  1   5   3

And say the "otherList" is the first column of x, so I want to count how many of x's rows have each of 1, 2, 3, etc. first

So I want counts to be

3,
1,
0,
(0s as long as this counts list goes)

Note it's more important that I be able to select out that data subset than that I get its length; I need to use the subset for other computations as well, though again want to select it out row-by-row and have the output of whatever computations I do stored in the row of the results (in this case counts) matrix.

I can obviously do this with a for loop, but what's the clever way to skip the loop?

Apologies if this duplicates another question. This seems like a very basic question, but I'm not sure what terms to search for. This question seems similar and would work for getting lengths, though I'm not clear on how to apply it in the general case.

EDIT

Here's an example. We select certain rows of x (here x is like otherData in my description above) that satisfy some row-dependent condition, in this case having a first col entry = to row, but the point is that "== row" could be replaced with any condition on row, e.g. "<= otherlist[row]-2" etc.

> x
   condition value
1          2    25
2          9    72
3         41    60
4         41    61
5         25    38
6         41    10
7         41    43
8         41    26
9         41    46
10        12   263
11        26   136
12        24   107
13         9    70
14        12    62
15        12   136
16        34    44
17        12    53
18        32    14
19        32   148
20         4    34

> results = 0*1:20
> results
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> for(row in 1:20) {
+ results[row] = length(x[x[["condition"]]==row,2]) }
> results
 [1] 0 1 0 1 0 0 0 0 2 0 0 4 0 0 0 0 0 0 0 0
Community
  • 1
  • 1
Philip
  • 7,253
  • 3
  • 23
  • 31
  • This question is not very clear at all. The best way to ask a question here is to build a small toy example that illustrates _exactly_ what you're trying to accomplish and that is [reproducible](http://stackoverflow.com/q/5963269/324364). – joran Jun 25 '12 at 15:22
  • @joran Seeing the answers below, I agree with your assessment and was just about to do that. – Philip Jun 25 '12 at 16:24
  • In your edit, why is it `row in 1:20` and not `row in 1:max(x$condition)`? – Aaron left Stack Overflow Jun 25 '12 at 17:08
  • @Aaron it shouldn't really be either; it's supposed to be arbitrary in the sense of it should be for row in 1:length(results) where it may be that results is 1000 long and no rows of x have any data relevant to most of those rows of results. The point is that I start with an idea in my head of how I want to populate a new list row-by-row, and can thus very naturally write a for-each loop that iterates over rows populating this new list. But can I skip the for-each loop by somehow vectorizing something yet still select elements from another list "by corresponding row." Apologies for confusion. – Philip Jun 25 '12 at 17:40
  • I think I finally understood what you wanted and, if so, it is a very straightforward application of the "[" function. – IRTFM Jun 25 '12 at 18:02

3 Answers3

2

Edited:

sapply( 1:20, function(z) sum(x[["condition"]] == z) )
#[1] 0 1 0 1 0 0 0 0 2 0 0 4 0 0 0 0 0 0 0 0

You would be able to substitute a different logical test and the sum would be the number of qualifying rows. (I was never able to figure out why you were using column number 2.) If you were hoping to select out a subset of rows that met a condition (which your example was not illustrating) then you could use this:

x[ x[,1] == test , ]  " e.g.

> x[ x$condition == 9, ]
   condition value
2          9    72
13         9    70

Or if you only wanted the column 'value' that corresponded to the tested 'condition' column , then use:

>  x[ x[['condition']] == 9, "value" ]
[1] 72 70

If you want to apply functions to selected (disjoint) subsets of x and you can create a factor variable as long as the dataframe then you can use aggregate or by to process the split up lists. If you want to use the sapply formalism above, here's an example that computes the separate means for subsets of "values" for rows having rownames that are in "condition":

> sapply( rownames(x), function(z) mean( x[x[["condition"]] == z , "value"]) )
 [1]   NaN  25.0   NaN  34.0   NaN   NaN   NaN   NaN  71.0   NaN   NaN 128.5   NaN   NaN   NaN   NaN
[17]   NaN   NaN   NaN   NaN
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • I know, if you look that's what I'm doing in my example above. The issue is that "9" isn't constant; it varies with the row into which this particular value will be put. So I'm looking for a way to say "for each row in some arbitrary list, build the selected dataset such that TEST" (your term) "equals the value of the list at that row, and store some function of that data subset as the row-th entry in some other list" – Philip Jun 25 '12 at 18:22
  • OK. I gave you the mean of "value" within each category. – IRTFM Jun 25 '12 at 18:26
  • OK This I understand, the form is sapply( list, function(z) action on data (data such that condition = z ) ). This is great; thanks! – Philip Jun 25 '12 at 18:32
1

What about table?

table(factor(x[, 1], x[1, ]))
# 
# 1 2 3 
# 3 1 0

Update

Using the second x table in your question, same solution:

table(factor(x$condition, rownames(x)))
# 
# 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
# 0  1  0  1  0  0  0  0  2  0  0  4  0  0  0  0  0  0  0  0

Also, try match:

match(x$condition, rownames(x))
# [1]  2  9 NA NA NA NA NA NA NA 12 NA NA  9 12 12 NA 12 NA NA  4
table(match(x$condition, rownames(x)))
# 
# 2  4  9 12 
# 1  1  2  4
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • this looks interesting but I'm not sure how this generalizes. Can it do e.g. if instead of length() in my example, I had sum() etc.? – Philip Jun 25 '12 at 17:41
  • @Philip, can you describe what you're trying to achieve or what problem you're trying to solve? Right now, it seems like you keep adding subsequent layers of abstractness to your question. – A5C1D2H2I1M1N2O1R2T1 Jun 25 '12 at 17:51
  • The abstraction is intrinsic to the problem. I'm trying to do a sequence of operations that involve manipulating subsets of data based on a common key (in this case, the row) and then recording the results of those manipulations in a list by row. So for example, I want: a count of how many data items match that key, then I want their average, then their weighted average, standard errors, medians, etc. I'm not even positive that that list is exhaustive, so I need a general solution. I'm using a for loop now and it works fine, but as an R novice I wasn't sure if there was a better way. Thx. – Philip Jun 25 '12 at 17:59
0
> a <- c(seq(1,10))
> a
 [1]  1  2  3  4  5  6  7  8  9 10
> d <- cbind(a,a)
> d
       a  a
 [1,]  1  1
 [2,]  2  2
 [3,]  3  3
 [4,]  4  4
 [5,]  5  5
 [6,]  6  6
 [7,]  7  7
 [8,]  8  8
 [9,]  9  9
[10,] 10 10
> d[,2]
 [1]  1  2  3  4  5  6  7  8  9 10
> d[,2] <- d[,1]*2
> d
       a  a
 [1,]  1  2
 [2,]  2  4
 [3,]  3  6
 [4,]  4  8
 [5,]  5 10
 [6,]  6 12
 [7,]  7 14
 [8,]  8 16
 [9,]  9 18
[10,] 10 20
> 
LanceH
  • 1,726
  • 13
  • 20
  • That much I understand, but that's not actually what I'm talking about. I'm trying--and I apologize for failing--to describe selecting out only particular rows on which to operate, based on a row condition, and storing the result in the row. So e.g. if, for each i, d[i,2] would be "the sum of all entries of d for which d[n,1] = i". – Philip Jun 25 '12 at 14:57
  • @Philip: Puzzled. In your description of what was needed you said to work on columns. – IRTFM Jun 25 '12 at 15:30