1

I am trying to create a new variable in a dataset (+/- 10K rows) by iterating over another data frame based on two variables in the dataset. Somehow I can't figure it out without using merge (just starting to learn R) maybe someone can guide me in the right direction.

Snapshot from the dataset - dat

ContractNumber | PDRating | PD_month 
---
1 | 7+ | 1      
---
1 | 7+ | 2
---
1 | 7+ | 3
---
2 | 6+ | 1
---
2 | 6+ | 2
---
and so on.....

So each contract has a PDRating (which stands for Probability of Default rating) and a PDMonth which represents the month the contract is in. So if a contract has a length of 60 months there are 60 records for the contract. (btw this is my first question on StackOverflow and I haven't figured out yet how I can format a nice looking table)

The data frame (PD_Table) I want to iterate over consists of 180 rows, representing months and in de header the PDRatings as column names. Which looks like this:

PD_month | 5- | 6+ | 6 | 6- | 7+ | ...
---
1 | 0.0001 | 0.0002 | 0.0004 | 0.0005 | ... 
---
2 | 0.0001 | 0.0002 | 0.0004 | 0.0005 | ... 
---
3 | 0.0001 | 0.0002 | 0.0004 | 0.0005 | ... 
---
4 | 0.0001 | 0.0002 | 0.0004 | 0.0005 | ... 
---
5 | 0.0001 | 0.0002 | 0.0004 | 0.0005 | ...  
---
and so on.....

The new variable I am trying to create is dat$PD and the new "dat" should look like this:

ContractNumber | PDRating | PD_month | PD
---
1 | 7+ | 1 | 0.0005
---
1 | 7+ | 2 | 0.0005
---
1 | 7+ | 3 | 0.0005
---
2 | 6+ | 1 | 0.0002
---
2 | 6+ | 2 | 0.0002
---
and so on.....

Right now I have done it by:

 # convert PD_Table to wide format

 melt(PD_Table, id.vars = "PD_month") 
 %>% rename(c("variable" = "PDRating", "value" = "PD"))

 # Merge datasets and sort

 arrange(merge(dat, PD_Table, by = c("PDRating", "PD_month"), 
              all.x = TRUE, sort = FALSE), ContractNumber)

Well this works but feels a bit clumsy and I have to sort the rows and reorder the columns again. So basically I am looking for a smarter solution

Amongst the things I tried, which didn't work, were the following:

dat$PD <- PD_Table[dat$PD_month, dat$PDRating]
# Returns a vector with all values for a rating as PD value]
# Note PD_Table was still in wide format

dat$PD <- for (i in nrow(dat)) PD_Table[dat$PD_month[i], dat$PDRating[i]]
# Does not returns anything

dat$PD <- for (i in nrow(dat3)) PD_Table[dat$PD_month[i],  which(dat3$PDRating[i] == colnames(PD_Table))]
# Does not returns anything

Any help much appreciated!

Kind regards, Matthies

  • could you provide a sample of your data by using `dput()` – loki Sep 15 '16 at 12:26
  • Melting and merging is definitely "smarter" than iterating in a for loop! – m-dz Sep 15 '16 at 12:27
  • This question looks pretty similar to [this one](http://stackoverflow.com/questions/39235882/how-can-i-reference-a-list-based-on-a-variable-within-a-data-frame/39236060). I'd recommend using matrix indexing as in my answer there. – lmo Sep 15 '16 at 12:32

1 Answers1

0

Does this work for you?

I have loaded data similar to the two tables you have shown from csv.

library(data.table)
PDR <- as.data.table(read.csv("PDRatings.csv"));
PDT <- as.data.table(read.csv("PDTable.csv"));

Melt the PD_table (here, PDT) to long form.

PDTm <- melt.data.table(PDT, id.vars=c("PD_month"), 
                        variable.name = "PDRating",
                        value.name = "PD")

Some ghastly clean-up of nomenclature.

PDTm[, PDRating := sub("X", "", PDRating)]      # X6    becomes 6
PDTm[, PDRating := sub("5.", "5+", PDRating)]   # 5.    becomes 5+
PDTm[, PDRating := sub("6.$", "6-", PDRating)]  # 6.    becomes 6-
PDTm[, PDRating := sub("7.$", "7-", PDRating)]  # 7.    becomes 7-
PDTm[, PDRating := sub("6..1", "6+", PDRating)] # 6..1  becomes 6+
PDTm

Lastly, merge data tables.

setkey(PDR, PDRating, PD_month)
setkey(PDTm, PDRating, PD_month)

merge(PDR, PDTm, all.x=TRUE)

What do you think? I got output as expected with the data, which is:

   PDRating PD_month ContractNumber    PD
1:       6+        1              2 5e-04
2:       6+        2              2 5e-04
3:       7+        1              1    NA
4:       7+        2              1    NA
5:       7+        3              1    NA
Sun Bee
  • 1,595
  • 15
  • 22
  • Hi Sun Bee, many thanks for the effort but this still seems to me a bit to much code for a relatively (at least conceptually :)) simple task. I was looking for a vectorized solution but for now I will just stick with the merging of datasets. Cheers, Matthies – Matthies Haentjens Sep 16 '16 at 12:18