1

So, i have this small database that i'm working with in RStudio, and I have this situation where i need to classify some students based on which classes they flunked.

for (i in 1:(nrow(BDbruno2)-1))
{
  #avoiding exploitations, i tested without it but my problem still continues
  if(i >= nrow(BDbruno2))
  {
    break 
  }else 
  {
    #checking is the codes are still the same
    if((BDbruno2$Code[i] == BDbruno2$Code[i+1]) && (i < nrow(BDbruno2)))
    {
      auxIndice <- BDbruno2$nLinha[i]
      auxTurmas <- BDbruno2$tempo[i]

      for(j in (i+1):nrow(BDbruno2))
      {
        #checking if codes are the same, if FALSE, i save all classes and save in a string for all codes
        if(BDbruno2$Code[j-1] != BDbruno2$Code[j]){
          BDbruno2$turmasCalc1[auxIndice] <- paste0(auxTurmas, collapse = " ")
          #skipping the same codes that i already checked
          i <- j
          #i tested without this break, only makes my code to take longer to finish
          break
        } else
        {
          #saving all rows where the code is the same
          auxIndice <- c(auxIndice, BDbruno2$nLinha[j])
          #this line below is where i get my problem:
          #it receives the classes from the same code, but when going further in the loop, this var gets messed
          auxTurmas <- c(auxTurmas, BDbruno2$tempo[j])
        }
      }
    }
  }
}

The auxTurmas var won't return the expected result, but it does when i run the code line-by-line.

There are these 4 cases where i get the same student (different rows), i save all his classes in a new var (turmasCalc1), which are 1 4 7 8, and all his rows gets theses numbers, but when inside a for-loop, the first one is correct, the second fails (4 7 8), the other two also fail(7 8).

Weirdly, if i run it from i = nLinha in 1:46, it works as it should, but i need it for all my cases (which happens a lot). I'm not sure, but it doesn't seem to be a issue with the break i used, but i can't see what makes this strange thing happen. Can somebody give me a light?

Edit: Sorry for the lack of information, here is a sample of the data frame, it's supposed to return 1 4 7 8 on row nLinha = 46:49, similar problem occurs other times on same data frame.

Code             Calculo1_Turma          tempo   turmasCalc1  nLinha
1635340632       2014/1 - MAT154-B          11         11     45
1638717605       2009/1 - MAT154-E          1     1 4 7 8     46
1638717605       2010/3 - MAT154-I          4       4 7 8     47
1638717605       2012/1 - MAT154-A          7         7 8     48
1638717605       2012/3 - MAT154-D          8         7 8     49
1643222643       2011/1 - MAT154-D          5         5 6     50
1643222643       2011/3 - MAT154-B          6         5 6     51
1645485641       2009/1 - MAT154-B          1           1     52

This is what I'm trying to get:

Code             Calculo1_Turma          tempo   turmasCalc1  nLinha
1635340632       2014/1 - MAT154-B          11         11     45
1638717605       2009/1 - MAT154-E          1     1 4 7 8     46
1638717605       2010/3 - MAT154-I          4     1 4 7 8     47
1638717605       2012/1 - MAT154-A          7     1 4 7 8     48
1638717605       2012/3 - MAT154-D          8     1 4 7 8     49
1643222643       2011/1 - MAT154-D          5         5 6     50
1643222643       2011/3 - MAT154-B          6         5 6     51
1645485641       2009/1 - MAT154-B          1           1     52

Edit2: Sorry again for the confusion. Here is the dpyr-generated code:

BDbruno2 <- structure(list(Code = c("1634171640", "1634171640", "1634171640", "1635340632", "1638717605", "1638717605", "1638717605", "1638717605", "1643222643", "1643222643", "1645485641"), Calculo1_Turma = c("2009/1 - MAT154-D", "2009/3 - MAT154-A", "2010/3 - MAT154-I", "2014/1 - MAT154-B", "2009/1 - MAT154-E", "2010/3 - MAT154-I", "2012/1 - MAT154-A", "2012/3 - MAT154-D", "2011/1 - MAT154-D", "2011/3 - MAT154-B", "2009/1 - MAT154-B"), tempo = c(1, 2, 4, 11, 1, 4, 7, 8, 5, 6, 1), turmasCalc1 = c("1", "2", "4", "11", "1 4 7 8", "4 7 8", "7 8", "7 8", "5 6", "5 6", "1"), nLinha = 42:52), .Names = c("Code", "Calculo1_Turma", "tempo", "turmasCalc1", "nLinha"), row.names = c(162L, 305L, 714L, 3880L, 210L, 715L, 887L, 924L, 2157L, 2446L, 60L), class = "data.frame")

This one generates a few more row below, which are working as they should. Just a recap: i'm supposed to get 1 4 7 8 on turmasCalc1 where nLinha in 46:49, but there seems to be a issue with the i index. This issue happens when the same code appears 3 or more times, not only specifically 4.

  • 2
    Can you also share sample data so that people can test the code? [How to make a great R reproducible example?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Tung Apr 17 '18 at 00:32
  • 2
    This looks like a terrible way to approach this problem. Can you describe in words what you are trying to accomplish, and share a little sample input and desired output? Probably it can be done if 5-10 lines of `dplyr` or `data.table` with no loops. – Gregor Thomas Apr 17 '18 at 00:54
  • I don't know if i put the data the right wya, i'm new here, so feel free to correct me – Bruno H. Rodrigues Apr 17 '18 at 02:08
  • @BrunoH.Rodrigues: please use `dput()`. Please read the link that I post – Tung Apr 17 '18 at 02:35
  • Where is the ``Code`` variable in the example data? did you rename it to English, or is it something else? – Melissa Key Apr 17 '18 at 02:44
  • I messed up before, i "translated" the `code` var because it was a bunch of letters before, i think a got it right this time with the data frame code too. Thanks for the help – Bruno H. Rodrigues Apr 17 '18 at 11:28

1 Answers1

1

I'm going to take a stab at starting this. I'm probably not going to get your output right (because I really have no idea what you want) - but I'm hoping this gets you started in the right direction.

library(dplyr)
BDbruno2 %>%
  group_by(CODIGO) %>%
  summarize(turmasCalc1 = paste(tempo, collapse = " ")) %>%
  left_join(select(BDbruno2, -turmasCalc1), .)

This is an R-style solution to the problem. R is an interpreted language - which is great for changing things frequently on the fly, but fairly slow for heavy computation. The solution is what is known in R as vectorized functions - which means that the heavy computation stuff and loops are implemented in a compiled language, and R just passes the data on. This means that writing loops IN R is almost always a bad idea - we want to call the vectorized version of the function which is going to do the looping for us. Here's a longer version of what I just said with some examples: http://alyssafrazee.com/2014/01/29/vectorization.html

More recently, another revolution has happened in R - which is manifested in a suite of packages called tidyverse, one of which I used above - dplyr. The entire suite of packages is amazing, but the big one is the use of pipes (%>%). All they do is take the result of the previous function, and set it as the first argument of the next function - but they allow us to linearize the function calls, and see what is going on.

Take the code above - I'm first grouping by CODIGO (which I'm assuming is the same as Code in the for loop you provided). No more looking for whether the code is the same or not, we are looking at the data in chunks, and the code is the same for everything in the chunk. The next function is summarize - which says that we want to generate a single summary for each code, and we're going to get it by pasting together the elements of tempo.

Finally, we're going to merge this back in with the original data set using left_join. Here, I want to make turmasCalc1 the last variable instead of the first, so I wanted the original data set (DBbruno2 to be the first arg). This is why it's called with a single dot after the column - I'm overriding the default behavior of entering in the result as the first argument, and making it the second instead.

Melissa Key
  • 4,476
  • 12
  • 21
  • thanks for the help, i ran your code but it's not clear for me what it is supposed to return; i read the link you also sent, but it's still not clear. could you tell me more about it? – Bruno H. Rodrigues Apr 17 '18 at 11:54
  • Let's back up a second - what is your input, and what is the desired output? Is `turmasCalc1` in the original data frame, or are you trying to create it? – Melissa Key Apr 17 '18 at 12:18
  • I'm trying to define/create `turmasCalc1` based on the `tempo` var of each row (when i have the codes) – Bruno H. Rodrigues Apr 17 '18 at 12:21
  • ok. that is what I thought. does the code I sent you give you the right result? (if not, what is wrong about it?) – Melissa Key Apr 17 '18 at 12:24
  • Well, i run the code and i get the same result. I'm checking the `turmasCalc1` var but it didn't change. What should I expect as a return? – Bruno H. Rodrigues Apr 17 '18 at 17:27
  • As I said before, I'm not 100% sure what output you want. The code I wrote should set `turmasCalc1 = "1 4 7 8"` for all rows with `Code = 1638717605`. What is your desired output? – Melissa Key Apr 17 '18 at 17:30
  • That's exactly what I want, `turmasCalc1 = "1 4 7 8"` for all these 4 cases, but only the first row of the repetition is getting it this way, the next one is missing `1`, and the other two missing `1 4`, which doesn't make any sense, because the `turmasCalc1`var is defined and closed when the for-loop ends. What can make my values get lost? – Bruno H. Rodrigues Apr 17 '18 at 18:25
  • I found a few typos which I've corrected. (Sorry - this was posted before you posted the data). It should give you what you want now. Note - if `turmasCalc1` is not yet defined and you get an error, replace the `left_join` command with `left_join(BDbruno2, .)` – Melissa Key Apr 17 '18 at 18:33
  • Worked like a charm, thank you very much, but i'm still confused on why my code won't work. I'll try to get into this `dplyr` from now on, it seems to be very useful. – Bruno H. Rodrigues Apr 17 '18 at 19:03