R text mining documents from CSV file

Question

First of all, my apology to repeat a question that was asked Aug 1 '13. But I cannot comment to the original question as I must have 50 reputation to be able to comment which I dont have. The original question can be retrieved from R text mining documents from CSV file (one row per doc) .

I am trying to work with the tm package in R, and have a CSV file of article abstracts with each line being a different abstract. I want each line to be a different document within the corpus. There are 2,000 rows in my data set.

I run the following codes as previously suggested by Ben:

# change this file location to suit your machine
file_loc <- "C:/Users/.../docs.csv"
# change TRUE to FALSE if you have no column headings in the CSV
x <- read.csv(file_loc, header = TRUE)
require(tm)
corp <- Corpus(DataframeSource(x))
docs <- DocumentTermMatrix(corp)

When I check class:

# checking class
class(docs)
[1] "DocumentTermMatrix"    "simple_triplet_matrix"

The problem is tm transformations do not work on this class:

# Preparing the Corpus
# Simple Transforms
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")

I get this error:

Error in UseMethod("tm_map", x) : 
no applicable method for 'tm_map' applied to an object of class "c('DocumentTermMatrix', 'simple_triplet_matrix')"

or another code:

docs <- tm_map(docs, toSpace, "/|@|nn|")

I get the same error:

Error in UseMethod("tm_map", x) : 
no applicable method for 'tm_map' applied to an object of class "c('DocumentTermMatrix', 'simple_triplet_matrix')"

Your help would be greatly appreciated.

You have to apply your function to the `Corpus` object and not to the `DocumentTermMatrix`. After `corp <- Corpus(DataframeSource(x))`, try `corp <- tm_map(corp, toSpace, "/")` and only then create your `DocumentTermMatrix`. — nicola, Mar 28 '16 at 06:52
@nicola Thank you very much. You were absolutely right. I got it run. However, it seemed to work until I created my dtm. The last codes were `docs <- tm_map(docs, stemDocument)` and `inspect(docs[16])` . The result is `Content: chars: 1190` which seems fine to me. But when I created dtm, the result of `dim(dtm)` is `[1] 2004 0` . Yes I have 2004 documents but 0?! nothing in my matrices?! Please advice. — Sahara, Mar 28 '16 at 08:35
It really depends on your data. Can't tell anything without seeing them. Give a look step by step of your corpus to see what's going on. — nicola, Mar 28 '16 at 09:11
Don't forget to add `stringsAsFactors = FALSE` in your `read.csv()` call. — Ken Benoit, Mar 30 '16 at 12:38

score 0 · Accepted Answer · answered Apr 01 '16 at 07:02

0

The code

docs <- tm_map(docs, toSpace, "/|@|nn|")

must be replaced with

docs <- tm_map(docs, toSpace, "/|@|\\|").

Then it will work fine.

answered Apr 01 '16 at 07:02

Sahara

11
1
5

R text mining documents from CSV file

1 Answers1