0

I am working on a project that requires me to iterate over a Document Term Matrix, converting all non-zero values to 1 and keeping zero values at zero. The function I'm using now takes forever to run, and I would like help optimizing the code.

My code as it is right now is

convert_counts <- function(x) {
                    x <- ifelse(x > 0, 1, 0)
                    x <- factor(x, levels = c(0, 1), 
                    labels = c("No", "Yes"))}

data_exp <- apply(data_dtm, 2, convert_counts)

Where data_dtm is a large Document Term Matrix.

  • I hope this helps you : https://stackoverflow.com/questions/12835942/fast-replacing-values-in-dataframe-in-r – Carles Nov 14 '18 at 16:28

1 Answers1

0

The function you have transforms a sparse matrix to a full character matrix. If you have a large document term matrix this will result in long running times and a good chance of getting a memory error. Replacing values in a sparse matrix can be done quickly if you make use of how the matrix is built. A sparse matrix values are stored in the v (values) part of the matrix. See ?slam::simple_triplet_matrix.

Using any of the apply family on a sparse matrix, without using functions that are designed to work with a sparse matrix will turn it into a normal (dense) matrix. With accordingly long run times and memory issues.

To change all values different from 0 in your case, just use the following:

data_dtm$v[data_dtm$v > 0] <- 1 inspect(data_dtm) # show first 10 columns and rows

This replaces all the values to 1 and keeps the data as a document term matrix (aka nice and sparse).

Depending on your follow up data analysis you really should make use of sparse matrix functions. If you want to transform a large document term matrix into a data.frame or data.table you have a good chance of running out of memory.

For any follow up questions, please include a reproducible example and an expected output.

phiver
  • 23,048
  • 14
  • 44
  • 56