0

I have two data frames. The first one has the client's id, name, and address. The second one has all of your transactions (values, date of purchase, cash or credit card ...).

str(data.frame_1)

Classes ‘data.table’ and 'data.frame':  201917 obs. of  5 variables:
 $ clie_id           : chr  "C_ID_97" "C_ID_3f" "C_ID_dd" "C_ID_11" ...
 $ address_1         : int  5 4 2 4 1 4 3 3 2 2 ...
 $ salary            : int  2 1 2 3 3 2 2 2 1 2 ...
 $ gender            : int  1 0 0 0 0 0 1 1 0 0 ...
 $ have_kids         : num  -0.82 0.393 0.688 0.142 -0.16 ...


str(data.frame_2)

 $ clie_id             : chr  "C_ID_00007093c1" "C_ID_00007093c1" "C_ID_00007093c1" "C_ID_00007093c1" ...
 $ city                : int  -1 -1 -1 -1 76 76 76 76 76 244 ...
 $ purchase_date       : Date, format: "2012-06-14" "2013-08-01" "2013-09-08" "2013-10-28" ...
 $ state               : int  -1 -1 -1 -1 2 2 2 2 2 2 ...
 $ sector              : int  8 8 8 8 33 33 33 33 1 34 ...
 $ category            : chr  "Y" "Y" "Y" "Y" ...
 $ purchase_amount     : num  -0.729 -0.709 -0.721 -0.672 -0.672 ...

Variables that I need to add in the date frame 1: oldest date, lower purchase value, higher purchase value, average value of purchases, quantity of purchases (in this case would be the number of lines of each id in the second data frame).

I tried to create a third date frame to then merge the columns of the first date frame with that of the third date frame using clie_id as reference. So I did this:

total_data_summarise_by_id <- data.frame_2 %>% 
                                  group_by(clie_id) %>%
                                  summarise(first_date = min(purchase_date),
                                            min_purchase_amount = min(purchase_amount),
                                            max_purchase_amount = max(purchase_amount),
                                            mean_purchase_amount = mean(purchase_amount))

However, the R returned only one answer line. He did not summarize for each id.

How can I do this join?

Ângelo
  • 149
  • 1
  • 13
  • In your sample data, none of the `clie_id` coincide, so this join would be empty. Is that your point? If not, could you please provide sample data that actually matches? Also, it would help immensely if your data was in a consumable format, such as the output from `dput(x)` where `x` is a representative sampling such that (in this case) the two frames have some ids in common. Please not the whole frame. – r2evans Feb 15 '19 at 21:28
  • The first data frame has the cli_id column as the primary key. The second data frame has this same column, in which the records repeat several, hence the need to summarize the data. I have edited the question with what I have been able to produce so far. – Ângelo Feb 15 '19 at 21:33
  • Perhaps there's a misunderstanding. Is `"C_ID_00007093c1"` the same person and id as `"C_ID_97"`? If so, then how is that to be inferred? If not, then how do you intend to match records in frame 2 with records in frame 1? It might help both of us if you include your expected output for the first few rows. (BTW: I still cannot easily copy your data as-is. `str` is useful for showing the basic structure of a frame, but I cannot copy it into my console and *use* it. See: https://stackoverflow.com/questions/5963269, https://stackoverflow.com/help/mcve, and https://stackoverflow.com/tags/r/info.) – r2evans Feb 15 '19 at 21:39
  • Assuming you know sql sintaxis, you can join tables using one of dplyr `*_join()` functions. But your example is not clear for me, because your summarise code only use data from the second data frame. For instance, you can read [Relational Data chapter on R for Data Science](https://r4ds.had.co.nz/relational-data.html) book. – gavg712 Feb 15 '19 at 22:16

1 Answers1

0

Success

total_data_summarise_by_id <- data.frame_2 %>% 
                                  group_by(clie_id) %>%
                                  summarise(first_date = min(purchase_date),
                                            min_purchase_amount = min(purchase_amount),
                                            max_purchase_amount = max(purchase_amount),
                                            mean_purchase_amount = mean(purchase_amount)),
                                            total = n())

Thanks a lot for the help

Ângelo
  • 149
  • 1
  • 13