0

I have a large CSV file with three columns of Reddit data, a subreddit name, a second subreddit name, and the number of unique commenters who have posted to both subreddits within the past month.

The CSV file contains the subreddit relationships going both ways, for instance, the following two lines exist in the CSV:

Roadcam,Nootropics,39
Nootropics,Roadcam,39

In total there are 35778434 lines in the CSV file.

I'm looking to import the CSV file into R and store it as a sparse matrix for analysis. This is how I am attempting to do this:

subreddit.overlaps <- read.csv("subreddit_overlaps_2017_01.csv")

subreddit.overlaps.matrix <- sparseMatrix(i = as.numeric(subreddit.overlaps[, 1]),
                                          j = as.numeric(subreddit.overlaps[, 2]),
                                          x = subreddit.overlaps[, 3])

However, the issue I'm having is that the dimensions of the produced sparse matrix are not what I would expect. The created sparse matrix appears to only have 4561 rows and 68825 columns. I would have expected the dimensions to be a perfect square, but that doesn't appear to be the case. Why would teh created sparse matrix not be a perfect square?

J0HN_TIT0R
  • 323
  • 1
  • 13
  • 1
    The simplest explanation is that your assumptions about your data are not correct. `sparseMatrix` is telling you there are 68825 distinct values in the second column (that you map to `j`), and only 4561 distinct values in the first column (that you map to `i`). What do `nlevels(subreddit.overlaps[, 1])` and `nlevels(subreddit.overlaps[, 2])` give? – Gregor Thomas Dec 06 '17 at 22:06
  • Would there be a quick and easy way to check if this is the case using the data directly from the CSV file? – J0HN_TIT0R Dec 06 '17 at 22:12
  • Use SED or AWK to pipe the first and second column to separate files [like in this question](https://stackoverflow.com/a/19602188/903061), then [do a sort unique as here](https://unix.stackexchange.com/q/189684/219475) to two more files, then diff the output. – Gregor Thomas Dec 06 '17 at 22:20
  • 1
    A little bit of math could help check too - the 68825 number is too big. If you had that many subreddits, pairwise, you would expect about 68825^2 = 4.7 Billion rows of data. However 4561 seems low, 4561^2 = 20 Million rows of data. You claim 35778434 rows which seems *close* to plausible for 5982 subreddits. The exact number of rows for `n` subreddits should be `n^2 - n`, which is 35778342 for 5982 subreddits, off by 2. – Gregor Thomas Dec 06 '17 at 22:27
  • @Gregor, you were right, the data I was working with didn't contain all of the data I thought it did. I was able to see this using the AWK/Sort/Diff method you described. – J0HN_TIT0R Dec 07 '17 at 05:23

0 Answers0