0

My question is about text mining, and text processing. I would like to build a co-occurrence matrix from my data. My data is:

dat <- read.table(text="id_reférence id_paper
        621107   621100
        621100   621101
        621107   621102
        621109   621103
        621105   621104
        621103   621105
        621109   621106
        621106   621107
        621107   621108
        621106   621109", header=T)

expected <- matrix(0,10,10)
### Article 1 has been cited by article 2
expected[2, 1] <- 1

Thanks in advance :)

cincinnatus
  • 27
  • 1
  • 10

2 Answers2

1
# loop through the observations of dat
for(i in seq_len(nrow(dat))) {
  # convert reference ids to integer and store in a vector
  # example data requires this step, you may already have integers in your actual data
  ref <- as.integer(strsplit(as.character(dat$id_reférence[i]), ",")[[1]])
  # loop through the list of references
  for(j in ref) {
    # mark the citations using (row, column) ~ (i, j) pairs
    expected[dat$id_paper[i], j] <- 1
  }
}

expected
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,]    0    1    0    0    0    0    0    0    0     0
# [2,]    0    0    0    1    0    0    0    1    0     0
# [3,]    1    0    0    0    1    0    0    0    0     0
# [4,]    0    0    0    0    0    0    0    1    0     0
# [5,]    0    0    0    1    1    0    0    0    1     0
# [6,]    0    0    1    0    0    0    0    1    0     0
# [7,]    0    1    0    1    0    0    0    0    0     0
# [8,]    0    0    0    0    0    1    0    0    1     0
# [9,]    0    0    0    0    0    0    0    0    0     1
# [10,]   1    0    0    1    0    0    0    0    1     0
OzanStats
  • 2,756
  • 1
  • 13
  • 26
  • 1
    Sorry, I can't help you without a [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) question. You updated the problem and none of these answers work anymore. The new question doesn't seem to be general enough as well. You will keep running into errors and wasting time if your example dataset is not representative of the original one. – OzanStats Nov 24 '18 at 21:40
0

Here another approach using data.table. A bottleneck might be that below approach does not end up in a sparseMatrix. Depending on the size of your data set it might be worth checking an approach aiming at a sparse data object.

library(data.table)
setDT(dat)
# split id_reférence column into multiple rows by comma
# code for this step taken from: #https://stackoverflow.com/questions/13773770/split-comma-separated-strings-in-a-column-into-separate-rows
dat = dat[, strsplit(as.character(id_reférence), ",", fixed=TRUE),
   by = .(id_paper, id_reférence)][, id_reférence := NULL][
    , setnames(.SD, "V1", "id_reférence")]
# add value column for casting
dat[, cite:= 1]
# cast you data into long format
dat = dcast(dat, id_paper ~ id_reférence, fill = 0)[, id_paper:= NULL]
Manuel Bickel
  • 2,156
  • 2
  • 11
  • 22
  • The data you posted does not produce proper tabular data. Could you check this again, please. Side note, you might also use `dput` for posting data examples. – Manuel Bickel Nov 24 '18 at 21:37
  • 1
    Loops are not necessarily slower than other alternatives or the code that doesn't use loops is not necessarily faster. Nice `data.table` approach though! – OzanStats Nov 24 '18 at 21:50
  • Your are right, if you initiate your data object loops are usually fine. I will delete the respective part of my answer. Thanks for still acknowledging my approach. – Manuel Bickel Nov 24 '18 at 22:05
  • For your updated data, you can use the code of my answer - just skip the step of splitting into multiple rows by comma and you are there. Please consider accepting the answer of @Ozan147 (which will also lead you to what you want with some slight adaptions) or mine if they provide a working solution for you. – Manuel Bickel Nov 25 '18 at 11:37
  • Thanks a lot for your answers. I've been on the project for several nights, and I have no choice but to ask questions to move forward. – cincinnatus Nov 25 '18 at 13:14
  • Asking questions is fine. Sometimes the difficulty is that one simply misses the right word to search for. In your case, reshaping / casting data is a standard task of data science, you simply did not now how to name your problem, otherwise, you would probably have found an answer quickly. So please understand that for such basic tasks the people who answer expect some kind of effort by people who ask in the sense "I have tried this and that but it does not work". Therefore, always include what you have tried so far. Good luck with your project and forthcoming questions. – Manuel Bickel Nov 25 '18 at 16:10