Reading a CSV into R and wanting to make a corpus from it with the tm package, but not getting the desired results. Currently, when I read in a CSV of text, then inspect the corpus, the data is all numerical. (I only included the first three columns of data
to protect privacy; there are nine as shown in the inspect results.)
library(tm)
data <- read.csv("filename.csv")
head(data)
Directory.Code First.Name Last.Name
1 SCA0025 Nbcde Cdbaace
2 SCA0025 AJCocei aiceice
3 SCA0025 aceca Ac;eice
4 SCA0025 Acoicm aie;cee
5 SCA0025 acei aciomac
6 SCA0025 caeij CIMCEv
data.corp <- corpus(DataframeSource,data)
inspect(data.corp[1])
A corpus with 1 text document
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
$`1`
16
2195
6655
6613
1
5
9757
1
1
If it helps to know the purpose: I am trying to read in a csv of names and un-normalized job titles/descriptions, then compare to a corpus of known titles/descriptions as categories. Now that I type this in, I realize that this csv will be my test/prediction data, but I still want to build a corpus from a csv with colnames = KnownJobTitle,Description.
The goal of this question is to successfully read a CSV into a corpus, but I would also like to know if it is advisable to use the tm package for more than 2 categorizations, and/or if there are other packages more suited to this task.