Reshape 3 Column Data with ID

Question

I'm trying to created a directed network graph in R. To do this I need to create a matrix of what nodes are connected.

SOURCE_SUBREDDIT    TARGET_SUBREDDIT          LINK_SENTIMENT
rddtgaming            rddtrust                  1
xboxone             battlefield_4               1
ps4                   battlefield_4             1
fitnesscirclejerk   leangains                   1
fitnesscirclejerk   lifeprotips                 1
cancer              fuckcancer                  1
jleague                 soccer                  1
bestoftldr              tifu                    1
quityourbullshit          pics                  1
bestof                    confession            1
anarchychess                funny               1
internet_box                ama                 1
fitnesscirclejerk             nofap             1
ffxiv                   ffxivapp                1
switcharoo              funny                   1
bitcoinmining         bitcoin                   1
subredditdrama        nfl                      -1
rddtgaming            rddtrust                 -1

As you can see above, the first and last pair have the same subredits. The data is showing the directional relationships between subreddits, which is why there are multiple pairs

Please see the photo for what I want the output to look like:

My code so far:

#reading in csv file
mydata <- read.csv(file="C:/Users/bmpmap/Documents/School/Netowrk Analysis/Connections List.csv", header=TRUE, sep=",")

colnames(mydata)
#SOURCE_SUBREDDIT TARGET_SUBREDDIT LINK_SENTIMENT


#install.packages("splitstackshape")
library(splitstackshape)
mydata_id = getanID(mydata , c("SOURCE_SUBREDDIT", "TARGET_SUBREDDIT", "LINK_SENTIMENT"))

colnames(mydata_id)

#reshaping data

I create an ID variable in the code above. I think I should be using this to uniquely identify the pairs

Possible duplicate of [How to reshape data from long to wide format](https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format) — divibisan, Apr 22 '19 at 17:43
I added a comment to clarify. This post is not a duplicate because Im trying to deal with pairs that are not unique — Miranda, Apr 22 '19 at 17:58
It's hard to help since you show us so little about what you have and what you want to do. For example, you mention an ID variable, but don't show it. Could you add a [mcve] of your data and desired output that fully represents what you're trying to do? — divibisan, Apr 22 '19 at 18:08
I create the unique ID variable in the following line of code: mydata_id = getanID(mydata , c("SOURCE_SUBREDDIT", "TARGET_SUBREDDIT", "LINK_SENTIMENT")) — Miranda, Apr 22 '19 at 18:10
That doesn't help us understand because even if we installed `splitstackshape` and ran that line, we don't have your data so we can't see what it does to your data. Can you show an example of `mydata_id` and what your desired output would be? — divibisan, Apr 22 '19 at 18:12
Again, your edits _still_ don't show the ID variable which is key to your problem. It still looks like a simple case of reshaping from Long-to-Wide which you could do with: `tidyr::spread(mydata, key = TARGET_SUBREDDIT, value = LINK_SENTIMENT, fill = 0)` — divibisan, Apr 22 '19 at 18:43
ID is not an input. I stated that in my post. It is a create variable to help uniquely identify pairs. There are multiple pairs with the same two subreddits, which is why I need an ID. When I use reshape without an ID I get an error that there are multiple pairs with the same names, so the first value is taken. — Miranda, Apr 22 '19 at 18:49
I'm not sure why you're being so resistant to providing a [mcve] of your data. Based on everything you've said, I can't see any reason why `tidyr::spread` wouldn't do exactly what you want to do. Maybe if I really dug into your problem I could figure out the wrinkle here, if there is any, but when you're asking someone to help you out for free, it's generally polite to try to make it easier for them. — divibisan, Apr 22 '19 at 19:03
I'm not sure I understand the issue... I linked the full data set below and included a sample of the data in text post above. I don't understand how that doesn't fulfill the requirements. Please explain to me how I can fix the issue. — Miranda, Apr 22 '19 at 19:08
Can you just include a sample of `mydata_id`, since that's the data frame we're trying to reshape? — divibisan, Apr 22 '19 at 19:15
Is the issue that you want the `fitnesscirclejerk/leangains` and `fitnesscirclejerk/lifeprotips` pairs to be on separate rows? So there's only ever 1 non-zero value per row? Is there also always only 1 non-zero value per column? — divibisan, Apr 22 '19 at 19:17

score 1 · Accepted Answer · answered Apr 22 '19 at 21:21

I'll take a stab at this. To start, we'll make a reproducible dataset from the example data you posted:

df <- structure(list(SOURCE_SUBREDDIT = c("rddtgaming", "xboxone", 
"ps4", "fitnesscirclejerk", "fitnesscirclejerk", "fitnesscirclejerk", 
"cancer", "jleague", "bestoftldr", "quityourbullshit"), TARGET_SUBREDDIT = c("rddtrust", 
"battlefield_4", "battlefield_4", "leangains", "lifeprotips", 
"leangains", "fuckcancer", "soccer", "tifu", "pics"), LINK_SENTIMENT = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), row.names = c(NA, 10L), class = "data.frame")

Note that fitnesscirclejerk is associated with leangains twice, which is a feature you mentioned occurs in your data:

df

    SOURCE_SUBREDDIT TARGET_SUBREDDIT LINK_SENTIMENT
1         rddtgaming         rddtrust              1
2            xboxone    battlefield_4              1
3                ps4    battlefield_4              1
4  fitnesscirclejerk        leangains              1
5  fitnesscirclejerk      lifeprotips              1
6  fitnesscirclejerk        leangains              1
7             cancer       fuckcancer              1
8            jleague           soccer              1
9         bestoftldr             tifu              1
10  quityourbullshit             pics              1

Now, the goal is to spread this from long-format to wide-format, as in the example image you posted. As you already determined, the identical rows (rows 4 and 6) pose a problem when trying to spread:

tidyr::spread(df, key = TARGET_SUBREDDIT, value = LINK_SENTIMENT, fill = 0)

Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 2 rows:
* 4, 6
Do you need to create unique ID with tibble::rowid_to_column()?

Since you want to keep the same number of rows when spreading, we can get around this by adding a unique ID to each row, so each row is unique. You do that with splitstackshape::getanID, but we can also do that with tidyverse packages:

df2 <- dplyr::mutate(df, rowid = dplyr::row_number())
df2 <- tibble::rowid_to_column(df)

Both of these give us this data.frame, which I am assuming is similar to your mydata_id:

df2

   rowid  SOURCE_SUBREDDIT TARGET_SUBREDDIT LINK_SENTIMENT
1      1        rddtgaming         rddtrust              1
2      2           xboxone    battlefield_4              1
3      3               ps4    battlefield_4              1
4      4 fitnesscirclejerk        leangains              1
5      5 fitnesscirclejerk      lifeprotips              1
6      6 fitnesscirclejerk        leangains              1
7      7            cancer       fuckcancer              1
8      8           jleague           soccer              1
9      9        bestoftldr             tifu              1
10    10  quityourbullshit             pics              1

Now, when we spread, the existence of the unique ID column keeps R from combining (or trying to combine) the rows with identical subreddit pairs:

df3 <- tidyr::spread(df2, key = TARGET_SUBREDDIT, value = LINK_SENTIMENT, fill = 0)
df3

   rowid  SOURCE_SUBREDDIT battlefield_4 fuckcancer leangains lifeprotips pics rddtrust soccer tifu
1      1        rddtgaming             0          0         0           0    0        1      0    0
2      2           xboxone             1          0         0           0    0        0      0    0
3      3               ps4             1          0         0           0    0        0      0    0
4      4 fitnesscirclejerk             0          0         1           0    0        0      0    0
5      5 fitnesscirclejerk             0          0         0           1    0        0      0    0
6      6 fitnesscirclejerk             0          0         1           0    0        0      0    0
7      7            cancer             0          1         0           0    0        0      0    0
8      8           jleague             0          0         0           0    0        0      1    0
9      9        bestoftldr             0          0         0           0    0        0      0    1
10    10  quityourbullshit             0          0         0           0    1        0      0    0

As you can see, the output of this mirrors the format of your desired output image and preserves both the order of the relationship and duplicate rows.

This is EXACTLY what I need! THANK YOU! I couldn't figure this problem out all weekend. You've saved my project! — Miranda, Apr 22 '19 at 23:32

Rushabh Patel · Answer 2 · 2019-04-22T19:11:47.563

You could do something like this-

> table(dt$SOURCE_SUBREDDIT,dt$TARGET_SUBREDDIT)

OUTPUT-

                  ama battlefield_4 bitcoin confession ffxivapp fuckcancer funny leangains lifeprotips nfl nofap pics rddtrust soccer tifu
  anarchychess        0             0       0          0        0          0     1         0           0   0     0    0        0      0    0
  bestof              0             0       0          1        0          0     0         0           0   0     0    0        0      0    0
  bestoftldr          0             0       0          0        0          0     0         0           0   0     0    0        0      0    1
  bitcoinmining       0             0       1          0        0          0     0         0           0   0     0    0        0      0    0
  cancer              0             0       0          0        0          1     0         0           0   0     0    0        0      0    0
  ffxiv               0             0       0          0        1          0     0         0           0   0     0    0        0      0    0
  fitnesscirclejerk   0             0       0          0        0          0     0         1           1   0     1    0        0      0    0
  internet_box        1             0       0          0        0          0     0         0           0   0     0    0        0      0    0
  jleague             0             0       0          0        0          0     0         0           0   0     0    0        0      1    0
  ps4                 0             1       0          0        0          0     0         0           0   0     0    0        0      0    0
  quityourbullshit    0             0       0          0        0          0     0         0           0   0     0    1        0      0    0
  rddtgaming          0             0       0          0        0          0     0         0           0   0     0    0        2      0    0
  subredditdrama      0             0       0          0        0          0     0         0           0   1     0    0        0      0    0
  switcharoo          0             0       0          0        0          0     1         0           0   0     0    0        0      0    0
  xboxone             0             1       0          0        0          0     0         0           0   0     0    0        0      0    0

NOTE- Your expected output doesn't show id column.

I don't see any `id` variable in your input data. Please provide correct input data for solution. — Rushabh Patel, Apr 22 '19 at 18:09
The full data set can be downloaded here, but it is a very large file: http://snap.stanford.edu/data/soc-RedditHyperlinks.html — Miranda, Apr 22 '19 at 18:18
I added more input data into the question incase the dataset is too large for you to download — Miranda, Apr 22 '19 at 18:35

Reshape 3 Column Data with ID

2 Answers2