Is there a way to count specific words from a corpus of PDFs in R?

Question

How can I count the number of specific words in a corpus of PDFs?

I tried using text_count but I honestly didn't understand what it was returned.

score 0 · Answer 1 · answered Jul 18 '22 at 17:18

First you would want to OCR the PDFs if necessary the convert them to raw text. pdftools can help with that and converting to text, but I am not sure that it can handle multiple columns.

https://cran.r-project.org/web/packages/pdftools/pdftools.pdf

Here is another post:

Use R to convert PDF files to text files for text mining

As above, you could use xpdf (installed via homebrew) to convert the pdfs, as I believe it has some more functionality as far as multiple columns/ text alignment goes.

After you have raw text, you can use a package like tm to obtain word counts in a corpus. Let me know if this works or if you have further questions.

Is there a way to count specific words from a corpus of PDFs in R?

1 Answers1