How can I count the number of specific words in a corpus of PDFs?
I tried using text_count but I honestly didn't understand what it was returned.
How can I count the number of specific words in a corpus of PDFs?
I tried using text_count but I honestly didn't understand what it was returned.
First you would want to OCR the PDFs if necessary the convert them to raw text. pdftools
can help with that and converting to text, but I am not sure that it can handle multiple columns.
https://cran.r-project.org/web/packages/pdftools/pdftools.pdf
Here is another post:
Use R to convert PDF files to text files for text mining
As above, you could use xpdf (installed via homebrew) to convert the pdfs, as I believe it has some more functionality as far as multiple columns/ text alignment goes.
After you have raw text, you can use a package like tm
to obtain word counts in a corpus. Let me know if this works or if you have further questions.