text mining in R on multiple files - mining for similar words in the files

Question

I have recently learned how to pick a single CSV file and find the most commonly used words in the file using text mining in R. What I would now like to do is to have R search through multiple CSV files (in my example I have 5) and pick out similar words that appear in each CSV file. FYI - in my 5 files, I've artificially inserted the word "hieroglyphics", and I would like my code to be able to pull this out as a matching word across all 5 files, along with any other words that match across all 5 documents.

I've set up the code as follows below, but am really struggling with how to proceed. Can anyone help?

Many thanks in advance,

Paul

P.S. As an extension (if the above is too easy for some of you!) - is there a way that you can pull out the number of the 5 CSV files that contain a word? Continuing the above example, if the word "Egypt" was only contained in 4 of the 5 CSV files, could R program to pull out every word and say "hieroglypics - 5", "Egypt - 4", etc. for all words in all 5 documents?

install.packages('tm')
library(tm)
setwd('C:\\Users\\900369\\Documents\\R\\Text Mining\\')
reviews1 <- read.csv("Evo-USA-Oct-Nov-141-160.csv",stringsAsFactors=FALSE)
reviews2 <- read.csv("Evo-USA-Oct-Nov-141-160 - Copy (2).csv",stringsAsFactors=FALSE)
reviews3 <- read.csv("Evo-USA-Oct-Nov-141-160 - Copy (3).csv",stringsAsFactors=FALSE)
reviews4 <- read.csv("Evo-USA-Oct-Nov-141-160 - Copy (4).csv",stringsAsFactors=FALSE)
reviews5 <- read.csv("Evo-USA-Oct-Nov-141-160 - Copy (5).csv",stringsAsFactors=FALSE)
filenames <- list.files('C:\\Users\\900369\\Documents\\R\\Text Mining\\',"*csv",FALSE,FALSE,FALSE,FALSE,FALSE)

Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). This will make it much easier for others to help you. — Jaap, Jan 05 '16 at 10:35
if you have the solution for one file, you should look into the apply family of function, especially `lapply`, to generalize on all items of a list. See here : http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega/7141669#7141669 and here for importing multiple csv : http://stackoverflow.com/questions/11433432/importing-multiple-csv-files-into-r — scoa, Jan 05 '16 at 11:08
Depending on what the return value is for your function (likely a dataframe of unique words and counts of them for each file) you could look into merging and aggregating them, which could give you exactly what you want (i.e. occurrences of words across all files, as well as the number of files the word occurs in). This link might be useful - https://www.miskatonic.org/2012/09/24/counting-and-aggregating-r/. If the files you have are large, too, I find thedplyr package works very well - http://www.onthelambda.com/2014/02/10/how-dplyr-replaced-my-most-common-r-idioms/ — Pash101, Jan 05 '16 at 12:22
Thanks for both your responses! Ideally I'd like the final output to simply give me a list of words that appeared in all INDIVIDUAL files, which means that aggregating and merging them would not be suitable (I could see how many times they appeared in the merged document, but it wouldn't tell me if they came up in each unique CSV file). An even more ideal outcome would be to tell me 'the word "pharmaceutical" appeared in 4 out of the 5 CSV files' and so on, for every word in all 5 documents. Hope this makes sense. scoa - apologies, I'm fairly new to R so have struggled with your suggestion! — paulmurph272003, Jan 05 '16 at 14:52

text mining in R on multiple files - mining for similar words in the files

0 Answers0