map reduce - extract text from PDF

Question

I have a large number of PDF files sitting in a s3 directory. How do I apply map-reduce/parallel process them using pyspark. All I want to do is to extract text from them and then store the text in a RDD; since the number of files is large I would like to do it in a parallel fashion.

pyspark has a method called wholeTextFiles which can read a directory of text files. But, I have it in a PDF format and I would like to pre-process the PDF to extract text from it before I can process the text.

Any help would be appreciated

score 0 · Answer 1 · answered Nov 09 '17 at 03:26

0

If you are working with PDFs then I believe that is not one of the formats that you can work directly from Spark. You can check spark-packages.org and see that there are no PDF libraries

However, there are many libraries that allow you to extract text with PDFs, for example Tika or Tesseract. So all you need to do is extract the text from each file. Luckily you can do this from Python using any of the libraries mentioned in this related post: Python module for converting PDF to text

Additionally, there is this blog post from Cloudera that can help you extract the text and do whatever you want with it with a few lines of Spark code and one library:

How-to: Index Scanned PDFs at Scale Using Fewer Than 50 Lines of Code

answered Nov 09 '17 at 03:26

xmorera

1,933
3
20
35

Thanks for the answer. I understand that I cannot use PDFs directly, but I want to parallel process the PDFs to extract text; and the PDFs are sitting in an s3 bucket. Is there a way I can do it using pyspark? So, something like using a processing function on all PDFs ( like using map). – Achuthan Sekar Nov 09 '17 at 23:35
1

When you use pyspark you have access to all of Python's functionality, therefore you could call a function that processes the PDF, something like map(lambda x: extractPDF(x)) that will return the text. You only need to create the function. Two things: you need to take into consideration performance as you are making a call to a UDF and read the Cloudera blog post I included in my answer, it explains a very similar scenario. – xmorera Nov 10 '17 at 14:19
Achuthan: did my answer help guide you in the right direction? – xmorera Nov 14 '17 at 14:08

map reduce - extract text from PDF

1 Answers1