1

I have a fairly large adjacency matrix and only want to keep the relationships that have had at least 5 transactions with each other. How would you do this? Would it make sense to assign 0 to all values less than 5 or is there a more sensible approach?

Should I then receive the new adjacency matrix, how can I then have the relationships output to me as a list in which each ID is output with the associated "partners".

Many thanks for your help :)!

Here is my code for the adjacency matrix so far:

dd <- head(newdata, 50000)
colnames(dd) <- c("MEMBER_ID","AUTHOR_ID")
x <- xtabs(~MEMBER_ID+AUTHOR_ID, dd)
mm <- crossprod(x,x)
mm[lower.tri(mm, TRUE)] <- NA

Here is a View() of the result in RStudio. enter image description here

Thats what I would like to have for each ID pair of my dataset. enter image description here

For completion, here is a reproducible sample of my original data SubsMain:

# > dput(head(SubsMAIN, 100))
structure(list(MEMBER_ID = c(199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781, 
199781, 199781, 199781, 199781, 199781, 199781, 199781, 199781
), RATING = c(5, 5, 5, 3, 5, 5, 4, 5, 3, 4, 5, 5, 5, 3, 4, 4, 
2, 5, 5, 5, 4, 5, 5, 5, 5, 4, 5, 3, 5, 4, 5, 4, 4, 3, 3, 2, 5, 
3, 5, 4, 5, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 4, 5, 5, 5, 3, 
4, 4, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 4, 4, 5, 
5, 4, 4, 5, 5, 4, 5, 3, 5, 3, 5, 5, 5, 2, 3, 5, 5, 3, 5, 4, 3
), AUTHOR_ID = c(258195, 201494, 409591, 1964674948, 284187, 
641414, 686042, 531975, 1892323204, 362579, 301950, 2988937092, 
205270, 353623, 657993, 2418118532, 590804, 222936, 216022, 2320404356, 
199862, 538993, 290046, 234885, 417532, 1705021316, 216430, 1320783748, 
301950, 2012450692, 3267006340, 321415, 213839, 1967230852, 519301, 
1880919940, 409850, 617204, 262004, 200165, 3267006340, 345500, 
1711443844, 290046, 238184, 241451, 452301, 301950, 205491, 212098, 
241578, 2367524740, 2366410628, 225252, 2988937092, 1789300612, 
1965068164, 432146, 2151190404, 1772130180, 290046, 203622, 210929, 
243427, 205705, 301950, 2551549828, 2250674052, 1378848644, 298157, 
1873186692, 526355, 231243, 2988937092, 241578, 547653, 1301319556, 
1956417412, 292382, 2571341700, 421709, 2309066628, 256232, 214201, 
447962, 278848, 2533396356, 328874, 1955106692, 262822, 1568706436, 
458913, 217003, 583640, 307259, 199780, 1836027780, 235786, 2366279556, 
358714), STATUS = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), CREATION = c("2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10", 
"2001/01/10", "2001/01/10", "2001/01/10", "2001/01/10"), LAST_MODIFIED = c("2001/03/24", 
"2001/08/25", "2002/12/02", "2001/03/29", "2002/03/22", "2002/04/22", 
"2001/01/22", "2001/11/15", "2001/04/10", "2001/03/24", "2001/04/03", 
"2001/10/11", "2001/05/08", "2001/03/07", "2002/01/26", "2002/03/10", 
"2001/03/24", "2001/03/25", "2001/01/28", "2001/09/06", "2001/05/22", 
"2001/05/03", "2001/01/18", "2001/10/26", "2002/01/09", "2001/08/21", 
"2001/02/09", "2001/03/14", "2002/03/22", "2001/03/19", "2001/02/10", 
"2001/01/19", "2001/02/09", "2001/09/28", "2001/01/19", "2001/01/31", 
"2001/03/19", "2001/01/31", "2001/02/09", "2001/03/07", "2001/08/10", 
"2001/09/29", "2001/07/31", "2001/06/20", "2001/07/03", "2001/09/12", 
"2001/03/30", "2002/05/07", "2002/08/10", "2002/02/23", "2001/09/06", 
"2001/03/19", "2001/10/30", "2001/01/29", "2001/04/28", "2001/11/17", 
"2002/02/23", "2001/03/15", "2001/10/28", "2001/01/31", "2001/06/12", 
"2003/08/06", "2002/01/09", "2001/08/30", "2001/12/22", "2001/08/21", 
"2001/04/16", "2001/11/15", "2002/05/03", "2001/03/15", "2001/08/29", 
"2001/09/12", "2001/11/17", "2001/10/04", "2001/08/20", "2001/08/21", 
"2001/11/17", "2003/08/06", "2001/04/03", "2001/07/22", "2001/02/11", 
"2001/09/12", "2001/07/03", "2001/05/11", "2002/01/09", "2001/03/05", 
"2001/07/10", "2003/06/25", "2001/02/18", "2001/03/27", "2001/06/06", 
"2002/08/11", "2001/04/27", "2001/02/18", "2001/08/22", "2002/02/23", 
"2001/10/30", "2001/07/03", "2001/06/04", "2003/04/28")), row.names = c(NA, 
100L), class = "data.frame")
Greg
  • 3,054
  • 6
  • 27
Sebastian
  • 15
  • 5
  • So each row in `dd` represents a transaction? – Greg Jul 28 '21 at 14:59
  • Yes. From Member ID to Author ID and Vice Versa – Sebastian Jul 28 '21 at 15:00
  • Is every ID represented in each column? Or are (say) some authors never members? – Greg Jul 28 '21 at 15:01
  • There are some members who are never authors and vice versa – Sebastian Jul 28 '21 at 15:04
  • So would you want your output list to have an element for each member, where that element is itself a list of all authors associated with that member? Or should that list also have an element for every ID of any sort, where a member element is a list of authors, and an author element is a list of members? And would you want an element for even those members without authors (and vice versa), where the element is an empty list? – Greg Jul 28 '21 at 15:07
  • In my original dataset, all transactions between member and author are timestamped. I now want to compile all interactions in a list for each author member pair, but only if the two had more than 5 or 10 interactions with each other. At the interaction the user can be member and/or author. – Sebastian Jul 28 '21 at 15:13
  • So you want `x <- 5` to be your threshold, which you can change at your leisure. Hmmmm... Try using `library(data.table)` and then `as.data.table(dd)[, .N, by = .(MEMBER_ID, AUTHOR_ID)][N > x]`, then tell me what you want from there. – Greg Jul 28 '21 at 15:17
  • The result should be a `data.table` with a row for every pair of `MEMBER_ID` and `AUTHOR_ID`, alongside the tally of transactions for that pair; excluding those with tallies less than `5`, the current value for `x`. Let me know where you'd want to go from there. – Greg Jul 28 '21 at 15:27
  • Woohooo Greg - you rock man!!! I got the data table. I want to go from there to make a subset of my SubMAIN containing just those interactions pairs from my new data.table. to make my main dataset much smaller. Thank you soo much !!! – Sebastian Jul 28 '21 at 16:43
  • Sorry, what is "SubMAIN" again? – Greg Jul 28 '21 at 17:17
  • Sorry man. I added SubMAIN as image in the top of my post. – Sebastian Jul 28 '21 at 17:24
  • So what you want to do is take `SubMain`, which is a dataset of individual transactions, and filter out all transactions belonging to pairings (of `MEMBER_ID` and `AUTHOR_ID`) that have fewer than `x` transactions? – Greg Jul 28 '21 at 17:27
  • Yeah thats right. – Sebastian Jul 28 '21 at 17:35
  • Hey Greg, I now want to evaluate whether there is a structural break in the rating behavior. For example, if user A always gives user B 5 out of 5 points and user B then gives user A one out of 5 points, would user A then adjust his rating behavior and rate user B worse in the future or not. I have now often read that the Chow test requires a known time to be validly executed. Unfortunately, I cannot determine this for each interaction pair. Do you have any ideas for this purpose? – Sebastian Nov 01 '21 at 13:01
  • Hi @Sebastian! Your original question deals with *wrangling*: reshaping the data and calculating new attributes within it. Your new idea about (Chow test, etc.) deals with _**modeling**_, so it might be better posted as a *new question*. **At any rate, if neither the `CREATION` nor the `LAST_MODIFIED` date will suffice as your "known date", then the problem is with the _original data source_, which might lack the granularity you need.** However, given the `SubsMAIN` dataset in your question, I think I've successfully addressed your wrangling question as thoroughly as possible. – Greg Nov 01 '21 at 14:09
  • I see you've already [posted a question](https://stackoverflow.com/q/69536867) about the Chow test, back on October 12th. Unless your data source is deficient, my answer should give you the `results` dataset you need for your analysis. So try posting a new question, and include a reproducible version (via `dput()`) of `results`. *P.S. I suspect the `LAST_MODIFIED` column is your best bet for a "known date".* – Greg Nov 01 '21 at 14:24
  • Hey @Greg, thank you so much for your quick answer. I generated an new question, regarding the modeling problem, to identify structural breaks in rating behavior. As you know I was able to generate tons of Interactions between two Interactions pairs. I now want to check each interaction part, if there is a structural break in rating behavior. As an example, I want to determine when a user A gives always 5/5 stars rating to user B and for example user B gives 1/5 stars rating back to user A, if user A will give in future transactions further 5/5 points to user B or if the rating behavior reject – Sebastian Nov 01 '21 at 15:17
  • Gotcha, that's a pretty interesting idea! My experience is better suited to data wrangling than to (say) the Chow test, so you might have to rely on others' expertise for the new answer. In the meantime, would you mind going to [my answer](https://stackoverflow.com/a/68565242) and [marking it as accepted](https://meta.stackexchange.com/a/5235)? – Greg Nov 01 '21 at 15:30
  • Hey Greg, if would have one further question for you. Is it possible that we split up the dating by Last_Modified Date? This means that I would like to have the Last_modified Date on the left side, and two columns beside which represent the rating according to the date. If a user did not rate on a Date I would like to fill the column by an NA Value. How is this possible ? (According two the second picture above). Thank you so much for your help :) – Sebastian Nov 07 '21 at 16:50
  • I already tried, dcast but it just takes the unique dates and the according number of ratings, but not the real rating applied?! – Sebastian Nov 07 '21 at 17:23

1 Answers1

0

Using the superior performance of data.table, we can avoid altogether the conversions to and from an adjacency matrix.

Given a SubsMAIN dataset like the one reproduced here

structure(list(MEMBER_ID = c(199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 301950, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781, 199781, 199781,
                             199781, 199781, 199781, 199781),
               RATING = c(5, 5, 5, 3, 5, 5, 4, 5, 3, 4, 5, 5, 5, 3, 4, 4, 2, 5,
                          5, 5, 4, 5, 5, 5, 5, 4, 5, 3, 5, 4, 5, 4, 4, 3, 3, 2,
                          5, 3, 5, 4, 5, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 4,
                          5, 5, 5, 3, 4, 4, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 5,
                          5, 5, 5, 5, 4, 4, 5, 5, 4, 4, 5, 5, 4, 5, 3, 5, 3, 5,
                          5, 5, 2, 3, 5, 5, 3, 5, 4, 3),
               AUTHOR_ID = c(258195, 201494, 409591, 1964674948, 284187, 641414,
                             686042, 531975, 1892323204, 362579, 199781,
                             2988937092, 205270, 353623, 657993, 2418118532,
                             590804, 222936, 216022, 2320404356, 199862, 538993,
                             290046, 234885, 417532, 1705021316, 216430,
                             1320783748, 301950, 2012450692, 3267006340, 321415,
                             213839, 1967230852, 519301, 1880919940, 409850,
                             617204, 262004, 200165, 3267006340, 345500,
                             1711443844, 290046, 238184, 241451, 452301, 301950,
                             205491, 212098, 241578, 2367524740, 2366410628,
                             225252, 2988937092, 1789300612, 1965068164, 432146,
                             2151190404, 1772130180, 290046, 203622, 210929,
                             243427, 205705, 301950, 2551549828, 2250674052,
                             1378848644, 298157, 1873186692, 526355, 231243,
                             2988937092, 241578, 547653, 1301319556, 1956417412,
                             292382, 2571341700, 421709, 2309066628, 256232,
                             214201, 447962, 278848, 2533396356, 328874,
                             1955106692, 262822, 1568706436, 458913, 217003,
                             583640, 307259, 199780, 1836027780, 235786,
                             2366279556, 358714),
               STATUS = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
                          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L,
                          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                          0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
                          0L, 0L),
               CREATION = c("2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10", "2001/01/10", "2001/01/10",
                            "2001/01/10"),
               LAST_MODIFIED = c("2001/03/24", "2001/08/25", "2002/12/02",
                                 "2001/03/29", "2002/03/22", "2002/04/22",
                                 "2001/01/22", "2001/11/15", "2001/04/10",
                                 "2001/03/24", "2001/04/03", "2001/10/11",
                                 "2001/05/08", "2001/03/07", "2002/01/26",
                                 "2002/03/10", "2001/03/24", "2001/03/25",
                                 "2001/01/28", "2001/09/06", "2001/05/22",
                                 "2001/05/03", "2001/01/18", "2001/10/26",
                                 "2002/01/09", "2001/08/21", "2001/02/09",
                                 "2001/03/14", "2002/03/22", "2001/03/19",
                                 "2001/02/10", "2001/01/19", "2001/02/09",
                                 "2001/09/28", "2001/01/19", "2001/01/31",
                                 "2001/03/19", "2001/01/31", "2001/02/09",
                                 "2001/03/07", "2001/08/10", "2001/09/29",
                                 "2001/07/31", "2001/06/20", "2001/07/03",
                                 "2001/09/12", "2001/03/30", "2002/05/07",
                                 "2002/08/10", "2002/02/23", "2001/09/06",
                                 "2001/03/19", "2001/10/30", "2001/01/29",
                                 "2001/04/28", "2001/11/17", "2002/02/23",
                                 "2001/03/15", "2001/10/28", "2001/01/31",
                                 "2001/06/12", "2003/08/06", "2002/01/09",
                                 "2001/08/30", "2001/12/22", "2001/08/21",
                                 "2001/04/16", "2001/11/15", "2002/05/03",
                                 "2001/03/15", "2001/08/29", "2001/09/12",
                                 "2001/11/17", "2001/10/04", "2001/08/20",
                                 "2001/08/21", "2001/11/17", "2003/08/06",
                                 "2001/04/03", "2001/07/22", "2001/02/11",
                                 "2001/09/12", "2001/07/03", "2001/05/11",
                                 "2002/01/09", "2001/03/05", "2001/07/10",
                                 "2003/06/25", "2001/02/18", "2001/03/27",
                                 "2001/06/06", "2002/08/11", "2001/04/27",
                                 "2001/02/18", "2001/08/22", "2002/02/23",
                                 "2001/10/30", "2001/07/03", "2001/06/04",
                                 "2003/04/28")),
          row.names = c(NA, 100L),
          class = "data.frame")

the following data.table solution

library(data.table)


# ...
# Code to generate your dataset 'SubsMAIN'.
# ...


# Set your cutoff for the minimum number of transactions.
x <- 3

# Filter 'SubsMAIN' to only those transactions for pairings that meet the cutoff.
results <- as.data.table(SubsMAIN)[
  # Mark each transaction with a new ID for its pairing of 'MEMBER_ID' with
  # 'AUTHOR_ID'.
  , Pair_ID := .GRP,
    # To make the relationship symmetric, pair by the MAX and MIN of the two
    # original IDs, rather than by their column order.
    by = .(pmax(MEMBER_ID, AUTHOR_ID), pmin(MEMBER_ID, AUTHOR_ID))][
  # Mark each transaction with the tally of all transactions for its pair.
  , Tally := .N, by = Pair_ID][
    # Include only those transactions whose tallies meet the cutoff.
    Tally >= x,
    # Exclude the 'Tally' column, so the header is exactly like 'SubsMAIN'.
    -c("Tally")]


# View results.
results

should yield results like this

    MEMBER_ID RATING  AUTHOR_ID STATUS   CREATION LAST_MODIFIED Pair_ID
 1:    301950      5     199781      0 2001/01/10    2001/04/03      11
 2:    199781      5 2988937092      1 2001/01/10    2001/10/11      12
 3:    199781      5     290046      0 2001/01/10    2001/01/18      23
 4:    199781      5     301950      0 2001/01/10    2002/03/22      11
 5:    199781      5     290046      0 2001/01/10    2001/06/20      23
 6:    199781      5     301950      0 2001/01/10    2002/05/07      11
 7:    199781      5 2988937092      1 2001/01/10    2001/04/28      12
 8:    199781      5     290046      0 2001/01/10    2001/06/12      23
 9:    199781      5     301950      0 2001/01/10    2001/08/21      11
10:    199781      5 2988937092      0 2001/01/10    2001/10/04      12

where every transaction from SubsMAIN is preserved, so long as it belongs to a pairing (Pair_ID) of MEMBER_ID and AUTHOR_ID having at least x transactions.

Note

For reference, here are the tallies that would be in the Tally column:

    MEMBER_ID RATING  AUTHOR_ID STATUS   CREATION LAST_MODIFIED Pair_ID   # Tally
 1:    301950      5     199781      0 2001/01/10    2001/04/03      11   #     4
 2:    199781      5 2988937092      1 2001/01/10    2001/10/11      12   #     3
 3:    199781      5     290046      0 2001/01/10    2001/01/18      23   #     3
 4:    199781      5     301950      0 2001/01/10    2002/03/22      11   #     4
 5:    199781      5     290046      0 2001/01/10    2001/06/20      23   #     3
 6:    199781      5     301950      0 2001/01/10    2002/05/07      11   #     4
 7:    199781      5 2988937092      1 2001/01/10    2001/04/28      12   #     3
 8:    199781      5     290046      0 2001/01/10    2001/06/12      23   #     3
 9:    199781      5     301950      0 2001/01/10    2001/08/21      11   #     4
10:    199781      5 2988937092      0 2001/01/10    2001/10/04      12   #     3

See how that 1st row, with a MEMBER_ID of 301950 and an AUTHOR_ID of 199781, is tallied together with the 4th, 6th, and 9th rows; which each have the reverse: a MEMBER_ID of 199781 and an AUTHOR_ID of 301950. That is, our Pair_ID (here 11) has preserved the symmetry requested here.

Now because the sample SubsMAIN had no pairings that tallied 5 (or more) transactions, I lowered the cutoff to x <- 3. That way, at least some transactions could make the cut, and there would be some output to display.

For your full dataset, free to change the cutoff to x <- 5, or whatever value you want.

Greg
  • 3,054
  • 6
  • 27