1

Reading a CSV into R and wanting to make a corpus from it with the tm package, but not getting the desired results. Currently, when I read in a CSV of text, then inspect the corpus, the data is all numerical. (I only included the first three columns of data to protect privacy; there are nine as shown in the inspect results.)

library(tm)

data <- read.csv("filename.csv")
head(data)    
  Directory.Code First.Name Last.Name
1        SCA0025     Nbcde    Cdbaace
2        SCA0025   AJCocei    aiceice
3        SCA0025      aceca   Ac;eice
4        SCA0025      Acoicm  aie;cee 
5        SCA0025     acei     aciomac
6        SCA0025       caeij   CIMCEv

data.corp <- corpus(DataframeSource,data)
inspect(data.corp[1])
A corpus with 1 text document

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:
  MetaID 

$`1`
16
2195
6655
6613
1
5
9757
1
1

If it helps to know the purpose: I am trying to read in a csv of names and un-normalized job titles/descriptions, then compare to a corpus of known titles/descriptions as categories. Now that I type this in, I realize that this csv will be my test/prediction data, but I still want to build a corpus from a csv with colnames = KnownJobTitle,Description.

The goal of this question is to successfully read a CSV into a corpus, but I would also like to know if it is advisable to use the tm package for more than 2 categorizations, and/or if there are other packages more suited to this task.

user1174265
  • 13
  • 1
  • 4

1 Answers1

1

I get the similar error. It's because the text fields read from the csv are categorical instead of char. You need to first convert those to character using something like:

data <- data.frame(lapply(data, as.character), stringsAsFactors=FALSE)
chappjc
  • 30,359
  • 6
  • 75
  • 132