-1

Not sure how to word this issue concisely, so may be a bad title and the answer may be out there already. I'm essentially having an issue with a nested for loop to insert values into a new dataframe.

I have a large datafile with repeated names and associated investments and bonuses that need rearranging to create a new Table/dataframe with totals for each of the associated columns. Here is a simple for of the data table that I have:

Test Data

I have written the following code with the Test Data a csv file written in Excel and saved as Book1:

Name <- c("Name 1","Name 1","Name 1","Name 1","Name 1","Name 2","Name 2","Name 2",
          "Name 3","Name 3","Name 3","Name 3","Name 3","Name 4","Name 4","Name 4")

Initial.Value <- c(142, 847, 982, 867, 898, 437, 169, 478,260, 789, 216, 373, 820, 
                   985, 943, 325)

Bonus.1 <- c(4, 2, 5, 0, 9, 6, 6, 7, 5, 8, 5, 5, 5, 8, 8, 8)

Bonus.2 <- c(4, 4, 0, 10, 5, 10, 2, 10, 8, 3, 9, 6, 3, 3, 2, 1)

Bonus.3 <- c(3, 0, 2, 7, 5, 0, 3, 6, 9, 5, 1, 2, 1, 5, 3, 2)

Bonus.4 <- c(1, 10, 2, 3, 2, 5, 7, 5, 3, 1, 6, 10, 3, 4, 7, 9)

data_file <- data.frame(Name, Initial.Value, Bonus.1, Bonus.2, Bonus.3, Bonus.4)    

Rows <- unique(data_file$Name)

Output_file <- data.frame(matrix(0, ncol = length(Rows), nrow = 5))

colnames(Output_file) <- Rows
rownames(Output_file) <- colnames(data_file)[c(2,3,4,5,6)]



for(i in length(Rows)){ # Looks at each name in turn
  
  Indices_Of_Interest <- which(lapply(data_file$Name, 
                                      function(x) any(match(x, Rows[i]))) == TRUE)
  
  for(k in length(Output_file[, 1])){ # Goes down the Output_File
    
    row_header <- rownames(Output_file)[k]
    col_header <- Rows[i]
    
    Output_file[row_header, col_header] <- sum(data_file[row_header][Indices_Of_Interest, ])
    
  }
  
}

When I run this code line by line it works by updating each cell in turn, but when I run the for loop it seems to only act on the final cell, leaving the other cells as zero, as follows:

Output file, correct structure but not filled correctly

James.H
  • 25
  • 6
  • 2
    As a general rule (there are exceptions), if you're using a loop in R, there's probably a better (meaning more concise and more efficient) way of doing things. Unfortunately, it's hard to help you without easy access to your input data. Please edit your post and include the output from `dput(data_file)` or `dput(head(data_file))`. [This post](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) may be helpful. Oh yes: and welcome to SO! – Limey Jun 20 '20 at 08:57
  • Thanks for the link buddy, I've edited it so that the data can be created from vectors. Is the original output file not okay the way it is though as it is created within the source code? – James.H Jun 20 '20 at 09:22
  • What is your expected output? Do you need `aggregate(.~Name, data_file, sum)` ? – Ronak Shah Jun 20 '20 at 09:26
  • As @Ronak asks, what is your desired output? When replying to a comment, use The at sign followed by the user name. That way the user gets notified. "Buddy" is a bit too generic for SO to handle! – Limey Jun 20 '20 at 09:30
  • The output file has the colnames as the original Names in the first column of data_file, but without any repetitions. The row names are the original column names and in each row entry they have the sum of all the entries (Initial.Value, Bonus.1, etc) that are associated with each name. There's a link at the bottom of the question that shows what the output file should look like. And the final entry shows that the summation part of the code works. – James.H Jun 20 '20 at 09:34
  • Actually James, the output file doesn't tell us what the output should look like: you tell us that it is wrong. What value should appear in each cell? If each entry is the sum of all bonuses associated with each name, why is there more than one cell for each name? – Limey Jun 20 '20 at 09:39
  • As an aside, I seriously suggest you read this post about [tidy](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) data and adopt the practices it recommends. – Limey Jun 20 '20 at 09:40
  • @Limey, thanks for your suggestions. However I thought I explained in the question that what I want is to merge criteria by summation of the values in the repeated criteria. For example if you look at the picture, you can see that Name 4 has a value of 20 for Bonus.4. When you look at the original data, there are three entries for Bonus.4 for Name 4. These are 4, 7 and 9, the summation of these three equal to 20, the desired output. As I said, when manually doing it line by line, selecting values for i and k, i can fill the Output table. Hope this makes it clearer. Cheers – James.H Jun 20 '20 at 09:56
  • I don't know what your code does but in the `for` loop you probably meant `for(i in 1:length(Rows))` and not `for(i in length(Rows))` .Same for inner loop with `k`. – Ronak Shah Jun 20 '20 at 10:27
  • @Ronak Shah that's sorted it; I missed out the seq() command in the for loops. Should have been `for(i in seq(length(Rows))` How do I credit you with answering it? That was embarrassingly stupid of me. – James.H Jun 20 '20 at 10:35
  • You can add an answer yourself with the code that worked for you. – Ronak Shah Jun 20 '20 at 10:40

2 Answers2

1

I'm not sure where "Initial.Value" comes from, but this gets you everything else.

library(tidyverse)

data_file %>% 
  group_by(Name) %>% 
  summarise(across(starts_with("Bonus"), sum), .groups="drop") %>% 
  pivot_longer(names_to="Index", cols=starts_with("Bonus")) %>% 
  pivot_wider(values_from="value", names_from="Name")
# A tibble: 4 x 5
  Index   `Name 1` `Name 2` `Name 3` `Name 4`
  <chr>      <dbl>    <dbl>    <dbl>    <dbl>
1 Bonus.1       20       19       28       24
2 Bonus.2       23       22       29        6
3 Bonus.3       17        9       18       10
4 Bonus.4       18       17       23       20

In response to OP's comment: with respect, I believe that the majority of the "complexity" to which they refer is caused by the fact that their data format is not "tidy". (See my earlier comment and link.) Working with tidy data, much of the complexity will disappear. The reason I claim that OP's data is not tidy is that there is relevant information in the column names: the type of payment ("Initial.Value" vs "Bonus") and the bonus index. This makes life more difficult than it needs to be. So, here's a possible solution, starting from OP's revised test data (including Initial.Payment) based on a potentially tidy dataset.

# Make the data tidy
tidyData <- data_file %>%
              pivot_longer(
                cols=c(starts_with("Bonus"), "Initial.Value"), 
                values_to="Value", 
                names_to="Source") 
tidyData %>% head(5)
# A tibble: 5 x 3
  Name   Source        Value
  <fct>  <chr>         <dbl>
1 Name 1 Bonus.1           4
2 Name 1 Bonus.2           4
3 Name 1 Bonus.3           3
4 Name 1 Bonus.4           1
5 Name 1 Initial.Value   142

Why do I claim this format is better than the original? Simply because it makes the code that follows completely independent of the number of bonuses, the types of payment ("Initial.Value", "Bonus.x", "Some other payment type", etc, etc) and the number of different names. I believe it is tidy in the context of the OP's example data, but it is not necessarily tidy in every context. It might, for example, be useful to separate Source into two or more columns, PaymentType and Index, for example. 'PaymentTypecould countainInitial.PaymentorBonusandIndexcould define theBonussuffix (and0, 1orNAforInitial.Payment` records). This would allow, for example, overall bonus to be easily calculated (again, independent of the number of bonus types).

So, now I have a tidy dataset, collating the required information is straightforward:

totalBonus <- tidyData %>% 
                group_by(Name, Source) %>% 
                summarise(Value=sum(Value), .groups="drop")

This dataset is still tidy, so it is optimal for further manipulation, but it is not necessarily optimal for presentation. But that's easy to fix. To provide OP's desired output:

totalBonus %>% 
  pivot_wider(names_from=Name, values_from=Value) %>% 
  arrange(desc(Source))
A tibble: 5 x 5
  Source        `Name 1` `Name 2` `Name 3` `Name 4`
  <chr>            <dbl>    <dbl>    <dbl>    <dbl>
1 Initial.Value     3736     1084     2458     2253
2 Bonus.4             18       17       23       20
3 Bonus.3             17        9       18       10
4 Bonus.2             23       22       29        6
5 Bonus.1             20       19       28       24
Limey
  • 10,234
  • 2
  • 12
  • 32
  • thanks for your response. I've edited the question again, I missed out the Initial values in the first data.frame() function. This will unfortunately not work for my actual dataset. What I have provided here is a simplified dataset, with simplified column names and row names; this is not how the actual data looks which was why I was trying it using "for loops" and indices. Because this dataset is so large and complex, it'll be easiest to use base R and write the code as I have done it. – James.H Jun 20 '20 at 10:21
  • @James.H - If you're not already familiar with the concept of tidy data, I encourage you to read Hadley Wickham's paper [Tidy Data](http://vita.had.co.nz/papers/tidy-data.pdf). As Limey pointed out in his/her answer, generally speaking it is a good idea to remove relevant data from column names in a data set. – Len Greski Jun 20 '20 at 11:12
  • @Len Greski That is exactly what my code is doing to my actual dataset; it is making it tidier. The code identifies the desired information from the column that has identifiers in it (Names in my example), identifies the indices associated with desired information and then uses those to extract the relevant information that I am interested in. The nested for loop then compiles it into a cleaner dataset. – James.H Jun 20 '20 at 11:21
0

The error in the code was syntax in the for loop.

for(i in length(Rows)){

Will initiate the for loop to work with an integer of length 1 and in the above case the function `length(Rows)' returns an integer of 4 (value 4, length 1). Therefore the loop only has one iteration and so only fills the final cell in the output table.

The loop should have been initiated as follows:

for(i in seq(length(Rows)){

The same goes for the nested loop, it should have been:

for(k in seq(length(Output_file[, 1]))){
James.H
  • 25
  • 6