0

This question is similar to this one asked earlier but not quite. I would like to iterate through a large dataset (~500,000 rows) and for each unique value in one column, I would like to do some processing of all the values in another column.

Here is code that I have confirmed to work:

df = matrix(nrow=783,ncol=2)
counts = table(csvdata$value)
p = (as.vector(counts))/length(csvdata$value)
D = 1 - sum(p**2)

The only problem with it is that it returns the value D for the entire dataset, rather than returning a separate D value for each set of rows where ID is the same.

Say I had data like this:
enter image description here

How would I be able to do the same thing as the code above, but return a D value for each group of rows where ID is the same, rather than for the entire dataset? I imagine this requires a loop, and creating a matrix to store all the D values in with ID in one column and the value of D in the other, but not sure.

InterLinked
  • 1,247
  • 2
  • 18
  • 50
  • Please edit your code to make your example reproducible by including sample data. To start, `df <- matrix(nrow = 783, ncol = 2)` creates a matrix with `NA` entries in 2 unnamed columns; so there are no columns `ID` and `value`. Secondly, `df` is a matrix so you cannot use `$` for indexing columns. – Maurits Evers Jul 16 '18 at 23:12
  • You still haven't provided sample data. Please review how to a provide a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), including sample data. We cannot help if you don't provide those details. – Maurits Evers Jul 16 '18 at 23:24
  • 1
    *"The data doesn't matter"* Sample data *always* matters! Keep in mind that you are asking people for help that have *no idea* what you have been doing, what you're trying to do, and what data you're working with. That's why MCVEs are a critical component when asking for debugging help. Your code example is still not reproducible. For example, `i` inside your loop is not defined anywhere. – Maurits Evers Jul 16 '18 at 23:31
  • *"Isn't it automatically defined as "for each" (i.e. "for i in 1:9") will run the loop 9 times"* What makes you think `i` gets "automatically" defined? There is no such thing as a variable being "automatically" defined. In `for (i in 1:9)` you define `i`; in your case you don't. – Maurits Evers Jul 16 '18 at 23:34
  • I think we're missing each others points. There seems to be a misunderstanding about what is meant by providing a reprex/MCVE. I'm happy to help, but I have no clue what you're trying to do: I can't work with your code because your example/issue is not reproducible. I don't understand the logic because you don't provide any details. I don't know what you're trying to achieve because you don't provide your expected output. The combination of these three shortcomings makes it very difficult (impossible) to help. – Maurits Evers Jul 16 '18 at 23:43
  • Right, I've added a simple example below. Please take a look. – Maurits Evers Jul 16 '18 at 23:58

2 Answers2

2

Ok, let's work with "In short, I would like whatever is in the for loop to be executed for each block of data with a unique value of "ID"".

In general you can group rows by values in one column (e.g. "ID") and then perform some transformation based on values/entries in other columns per group. In the tidyverse this would look like this

library(tidyverse)
df %>%
    group_by(ID) %>%
    mutate(value.mean = mean(value))
## A tibble: 8 x 3
## Groups:   ID [3]
#  ID    value value.mean
#  <fct> <int>      <dbl>
#1 a        13       12.6
#2 a        14       12.6
#3 a        12       12.6
#4 a        13       12.6
#5 a        11       12.6
#6 b        12       15.5
#7 b        19       15.5
#8 cc4      10       10.0

Here we calculate the mean of value per group, and add these values to every row. If instead you wanted to summarise values, i.e. keep only the summarised value(s) per group, you would use summarise instead of mutate.

library(tidyverse)
df %>%
    group_by(ID) %>%
    summarise(value.mean = mean(value))
## A tibble: 3 x 2
#  ID    value.mean
#  <fct>      <dbl>
#1 a           12.6
#2 b           15.5
#3 cc4         10.0

The same can be achieved in base R using one of tapply, ave, by. As far as I understand your problem statement there is no need for a for loop. Just apply a function (per group).


Sample data

df <- read.table(text =
    "ID value
a 13
a 14
a 12
a 13
a 11
b 12
b 19
cc4 10", header = T)

Update

To conclude from the comments&chat, this should be what you're after.

# Sample data
set.seed(2017)
csvdata <- data.frame(
    microsat = rep(c("A", "B", "C"), each = 8),
    allele = sample(20, 3 * 8, replace = T))

csvdata %>%
    group_by(microsat) %>%
    summarise(D = 1 - sum(prop.table(table(allele))^2))
## A tibble: 3 x 2
#  microsat     D
#  <fct>    <dbl>
#1 A        0.844
#2 B        0.812
#3 C        0.812

Note that prop.table returns fractions and is shorter than your (as.vector(counts))/length(csvdata$value). Note also that you can reproduce your results for all values (irrespective of ID) if you omit the group_by line.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • 1
    @InterLinked What's `a` and `b`? They are undefined. There is *definitely* no need for an explicit `for` loop. – Maurits Evers Jul 17 '18 at 00:04
  • @InterLinked ??? If you `df %>% group_by(ID) %>% mutate(new_val = some_function(value))` you *"grab all the "value" entries for the rows with that ID"*. That's exactly what `group_by` + `mutate` is there for! Or in base R `ave(data$value, data$ID, some_function)`! I think you need to read up on some basic R data transformation concepts: For example, the `dplyr` methods, or in base R `ave`, `tapply`, `by` and so on. – Maurits Evers Jul 17 '18 at 01:23
  • @InterLinked `group_by` is part of `dplyr`; so you need to load the library. Or better yet, do `library(tidyverse)` (see my code above). Either way. I don't think we're getting anywhere here, and I've run out of ideas and time how to phrase what it is that you need to supply in order for others to help. Perhaps somebody else is able to pick up. Good luck. – Maurits Evers Jul 17 '18 at 01:28
  • @InterLinked Look at my example code! It's fully reproducible and should get you started. – Maurits Evers Jul 17 '18 at 01:30
  • @InterLinked `df` has to be a `data.frame`, not a `matrix`. If necessary convert to `data.frame` with `as.data.frame`. See my example with reproducible data. That's why an MCVE includes sample data (not a screenshot but provided through `dput`; it's all in the links I gave you earlier;-) – Maurits Evers Jul 17 '18 at 03:01
  • @InterLinked Sigh. Again: **No reproducible data -> Not able to help**. I don't know what `csvdata` is. My code works with the sample data I give (which is based on your screenshot). You should've provided representative sample data. You didn't. – Maurits Evers Jul 17 '18 at 03:10
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/176098/discussion-between-maurits-evers-and-interlinked). – Maurits Evers Jul 17 '18 at 03:23
0

A base R option would be

df1$value.mean <- with(df1, ave(value, ID))
akrun
  • 874,273
  • 37
  • 540
  • 662