I have a large number of PDF files sitting in a s3 directory. How do I apply map-reduce/parallel process them using pyspark. All I want to do is to extract text from them and then store the text in a RDD; since the number of files is large I would like to do it in a parallel fashion.
pyspark has a method called wholeTextFiles which can read a directory of text files. But, I have it in a PDF format and I would like to pre-process the PDF to extract text from it before I can process the text.
Any help would be appreciated