Extracting text from PDF with Python in repl

Question

I am trying to read data from a PDF in python, and I am trying to use a repl.it file just because it is easier to test out different libraries. I have tried PyPDF2, and PyPDF4, which work but do not give any whitespace. tika gives me a server starting error, pdfminer does not work and pdfminer3 works without whitespace. pdftotext does not download properly. I was wondering if there was more clear documentation on how to my pdfminer3 give whitespace, or if there are more libraries to try.

score 0 · Answer 1 · answered Oct 12 '19 at 03:53

Give tika another try? From other posts I gather it is a pretty good solution.

I was able to install tika from the instructions on here:

https://github.com/chrismattmann/tika-python

and successfully parse a test pdf file.

STEPS I FOLLOWED TO USE TIKA WITH PYTHON:

1) Installation (with pip):

pip install tika

2) Create and run a test python script: (of course replace myfile.pdf with the path to your own pdf file)

#!/usr/bin/env python
import tika
tika.initVM()
from tika import parser
parsedPDF = parser.from_file('myfile.pdf')
print(parsedPDF["metadata"])
print(parsedPDF["content"])

Note that, as per your error with the tika server not starting, you may want to check this post as well:

Use tika with python, runtimeerror: unable to start tika server

The currently most upvoted answer on that post basically says to make sure that you have Java installed, and that your installation is at Java 8, as all new versions of the tika-server.jar will require Java 8.

Hope this helps, and good luck!

I installed tika and Java, and they both properly installed, however when I run the code it returns Traceback (most recent call last): File "C:\Users\ttaw2\OneDrive\Desktop\HIstory\2018-2019 JV Regionals\readhistoryproblem.py", line 6, in print(parsedPDF["content"]) File "C:\Users\ttaw2\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u0327' in position 4132: character maps to — FightingJ, Oct 14 '19 at 00:59
@FightingJellybean Did you use the same basic python script I referenced above? I am not sure what the problem might be. Maybe something to do with the character encoding of the PDF file itself. Would need the actual file to replicate and debug I suppose. You might also wish to check this link for troubleshooting problems with PDF text: https://cwiki.apache.org/confluence/display/tika/Troubleshooting%20Tika#PDF_Text_Problems — Wattholm, Oct 14 '19 at 08:30

score 0 · Answer 2 · answered Oct 12 '19 at 03:55

# import the libraries for PyDF2
import PyPDF2 
# Making a pdf file 
pdf_file = open('example.pdf', 'rb') 
# creat a pdf 
pdf_reader = PyPDF2.PdfFileReader(pdf_file) 
# print the number of pages in pdf 
print(pdf_reader.numPages) 
# creat the ojbct of pages 
page_obj= pdf_reader.getPage(0) 
# extracting text from page 
print(page_obj.extractText()) 
# closing the pdf file object 
pdf_file.close()

Extracting text from PDF with Python in repl

2 Answers2