0

I have built a data.table dataframe dt1 like:

State Bounced Opened Delivered
FL    2       1      0
NY    0       1      1

and I have another data.table dt2 with the total values:

State Total
NY    2
FL    3   

I'd like to find the percent frequency for each column in the first data table:

State Bounced Opened Delivered
FL    66.66   33.33  0
NY    0       50     50

If I simply * 100 and divide all the first data table by the Total column of the second data table, I have the result but the first column with State names became NA, which is not good.

If I do like:

dt1[, percen:=Bounced/dt2$Total]

it works but I have to repeat the process for all the columns. Is there a way to replicate to all the columns but the first one?

Second question: what is the best way to check that the Total value used is the right one - i.e. belongs to the right State? What I have done so far is to just order the State column alphabetically, but I am wondering if data.table offers a way to control this while doing the percentages...

smci
  • 32,567
  • 20
  • 113
  • 146
user299791
  • 2,021
  • 3
  • 31
  • 57
  • This is called a `join` operation between df1, df2. In this case it doesn't matter if we do left or right join and inner or outer join, since both dfs State is fully populated. Your second question is redundant, since if you do a `join` it will get the order right across tables. – smci Oct 21 '16 at 10:29
  • thanks for the pointer, you suggest to use `merge` ? – user299791 Oct 21 '16 at 10:45
  • Well if they're plain dataframes, yes join with [`merge`](http://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right). If they're data-tables, use [DT's join syntax](https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro.html) – smci Oct 21 '16 at 10:51

2 Answers2

2

This is a join operation, so use data.table join syntax, it's a one-liner:

dt1 <- data.table(State=c('FL','NY'), Bounced=c(2,0),
                  Opened=c(1,1), Delivered=c(0,1), key='State')
dt2 <- data.table(State=c('NY','FL'), Total=c(2,3), key='State')

colsToDivide = c('Bounced','Opened','Delivered')

dt1[dt2, (colsToDivide) :=
  lapply(mget(colsToDivide), function(x) {100*x/Total} ) ]

#   State  Bounced   Opened Delivered
#1:    FL 66.66667 33.33333         0
#2:    NY  0.00000 50.00000        50
smci
  • 32,567
  • 20
  • 113
  • 146
  • thanks, albeit not using join, I have explicitly stated in the question a solution to achieve this but asked for a solution that can manage all the columns without naming each of them explicitly (I have many more)... is there a data.table syntax I can refer to? – user299791 Oct 21 '16 at 12:08
  • something along the line of `df1[df2, names(df1)[-1] := WHATHERE? ]` I guess – user299791 Oct 21 '16 at 12:11
  • 1
    Yes, it can be done, with some use of `lapply`, I'm trying to get it. See [DT divide list of columns by a second list of columns](https://stackoverflow.com/questions/37802687/r-data-table-divide-list-of-columns-by-a-second-list-of-columns), which is harder. – smci Oct 21 '16 at 12:14
  • You can use `mget` to grab a list of columns by name. – Frank Oct 21 '16 at 12:24
  • 2
    ...and parameterized all the var names we want divided... I looked at [Dynamically build call for lookup multiple columns](https://stackoverflow.com/questions/30468455/dynamically-build-call-for-lookup-multiple-columns/) and use `mget` on the RHS. It is quite ugly though. Phew! – smci Oct 21 '16 at 12:41
0

I am not very good with data.table , however, I think this would work.

cbind(State = df1$State, 
      df1[, -1, with = FALSE]/df2$Total[match(df1$State, df2$State)]* 100)

#   State  Bounced   Opened  Delivered
#1:    FL  66.667    33.333         0
#2:    NY  0.000     50.000        50

Considering df1 and df2 are data.table.

This would also answer your second question as it matches the State in df1 with that of df2.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • thanks, maybe my question wasn't much clear about that, but I want to retain the State column in the final result – user299791 Oct 21 '16 at 10:18
  • Really you should use `merge()`, this is a join. `merge` is more performant and explicitly handles the cases where some columns are missing/ don't match. – smci Oct 21 '16 at 11:15
  • @smci you mean to say `merge(df1, df2, by = "State")` and then take the `Total` value and divide it throughout ?. Is there a way it can be done in the merge operation itself ? – Ronak Shah Oct 21 '16 at 11:20
  • 1
    Since they're actually data.tables, use [DT's join syntax](https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro.html). An inner join is simply `dt1[dt2]`. Read about it [here](http://stackoverflow.com/a/9652931/202229) – smci Oct 21 '16 at 11:33
  • ok I have converted all to data.table and set the key, so the join is easy now, but how to change the code in the answer to obtain the same result? – user299791 Oct 21 '16 at 12:04
  • Ohh..I see. I didn't know this. – Ronak Shah Oct 21 '16 at 12:45