0

How can I count the number of specific words in a corpus of PDFs?

I tried using text_count but I honestly didn't understand what it was returned.

Hamed
  • 5,867
  • 4
  • 32
  • 56

1 Answers1

0

First you would want to OCR the PDFs if necessary the convert them to raw text. pdftools can help with that and converting to text, but I am not sure that it can handle multiple columns.

https://cran.r-project.org/web/packages/pdftools/pdftools.pdf

Here is another post:

Use R to convert PDF files to text files for text mining

As above, you could use xpdf (installed via homebrew) to convert the pdfs, as I believe it has some more functionality as far as multiple columns/ text alignment goes.

After you have raw text, you can use a package like tm to obtain word counts in a corpus. Let me know if this works or if you have further questions.

dcsuka
  • 2,922
  • 3
  • 6
  • 27