How do i retain my unique identifiers when doing a clean and stem using tm package in R?

Asked Sep 22 '20 at 14:03

Active Sep 22 '20 at 14:03

Viewed 47 times

#to prepare for dataframesource you must change name to doc_id and text.
textdataframe <- textdataframe %>% rename(doc_id= orig_id, text= orig.narr)

corpus=Corpus(DataframeSource(textdataframe))

corpus = tm_map(corpus, PlainTextDocument)
corpus = tm_map(corpus, tolower)
corpus[[1]][1]  

#remove punctuation
corpus = tm_map(corpus, removePunctuation)
corpus[[1]][1]

#remove stopwords
corpus = tm_map(corpus, removeWords, c("cloth", stopwords("english")))
corpus[[1]][1]  

#stemming
corpus = tm_map(corpus, stemDocument)
corpus[[1]][1]

What ends up happening is i lose my unique id's that i assigned when setting dataframe source. I would like to set it up and continue it to be edited as i go along with clean and stem.

asked Sep 22 '20 at 14:03

user35131

1,105
6
18

Run this `meta(corpus, type = "indexed")`, what output you receive? – Ankit Sep 22 '20 at 15:02
data frame with 0 columns and 3702 rows – user35131 Sep 22 '20 at 15:05
When i don't add the dataframesource part it works the way i want it to. – user35131 Sep 22 '20 at 15:15
Even if you add `DataFrameSource` you should still be able to fetch identifiers. Run `inspect(corpus[1:2])` it should return `id` and `text` for first 2 rows. – Ankit Sep 22 '20 at 15:28
yes i was able to fetch two rows, but i have no way of telling the identity of each row. – user35131 Sep 22 '20 at 15:30
Do you have duplicate values in `doc_id` column? – Ankit Sep 22 '20 at 15:53
surprisingly no duplicates. – user35131 Sep 22 '20 at 15:56
Maybe you need to manually keep the ID information and join it again, see this [solution](https://stackoverflow.com/a/19851799) – Ankit Sep 22 '20 at 16:27
I didn't want to do a join because the point of this is using this as a check against a join. – user35131 Sep 22 '20 at 16:33
Run this `meta(corpus, "id")` after clean and stem step, maybe the IDs are being stored. [Source](https://stackoverflow.com/a/43516782) – Ankit Sep 22 '20 at 17:07
it gives me the row numbers not the id that i assigned to it. Its like it washes it away once i start the cleaning process. – user35131 Sep 22 '20 at 17:08
It's due to tm_map, a user asked a similar question, see if this [solution](https://stackoverflow.com/a/25639656) works. – Ankit Sep 22 '20 at 17:17
That worked. Initially i said it gave me different words, but i recognize its the same just the order changed which is fine.Thank you so much – user35131 Sep 22 '20 at 20:37

How do i retain my unique identifiers when doing a clean and stem using tm package in R?

0 Answers0