1

I wan't to use pdfbox in python, I have installed using this https://pypi.org/project/python-pdfbox/ , but when I try to run p = pdfbox.PDFBox() I am getting following error.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/suyog/anaconda3/lib/python3.6/site-packages/pdfbox/__init__.py", line 81, in __init__
    self.pdfbox_path = self._get_pdfbox_path()
  File "/home/suyog/anaconda3/lib/python3.6/site-packages/pdfbox/__init__.py", line 57, in _get_pdfbox_path
    r = urllib.request.urlopen(pdfbox_url)
  File "/home/suyog/anaconda3/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/home/suyog/anaconda3/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/home/suyog/anaconda3/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/home/suyog/anaconda3/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/home/suyog/anaconda3/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/home/suyog/anaconda3/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

Any idea how to use PDFBOX in ubuntu?

cs95
  • 379,657
  • 97
  • 704
  • 746

2 Answers2

1

So, it seems like the existing distro is outdated:

  1. The latest version is 2.0.9, and the link for 2.0.8 is defunct
  2. The code attempted to verify the package's integrity by downloading an md5 file which no longer exists for the current version.

I've taken the liberty of forking the existing repo and implementing the patch. The working version of this wrapper can be found here.

To install from my repository with pip, follow the directions posted here. Alternatively, download the source and run python setup.py install in the directory.

Running the code works for me:

In [8]: import pdfbox
   ...: p = pdfbox.PDFBox()
   ...: 

In [9]: p
Out[9]: <pdfbox.PDFBox at 0x1046254e0>
cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    thanks, downloading and setting it up worked for me. – Suyog Chadawar May 29 '18 at 05:37
  • how can I extract font size, color, font style with the help of pdfbox in python. I can only see functionality for extracting the text. – Suyog Chadawar May 29 '18 at 08:39
  • @SuyogChadawar Hey I'd recommend asking a new question. I'm not familiar with pdfbox, my knowledge ends with knowing how to install it. – cs95 May 29 '18 at 08:39
  • 1
    @SuyogChadawar There are very many questions (and answers) here on stack overflow dealing with that topic. – mkl May 29 '18 at 08:52
  • @mkl I searched for this on stackoverflow but could only find solutions for java and not in python, could you point me to the questions which will help me use pdfbox in python for extracting font size and others, thanks – Suyog Chadawar May 29 '18 at 09:41
  • As PDFBox is first and foremost a Java library, it is not surprising that most examples are in Java. It shouldn't be too difficult to port them, though. – mkl May 29 '18 at 16:35
1

Adding on to this answer, since it feels incomplete to a person installing this for the first time.

Doing a pip install python-pdfbox points to the project https://pypi.org/project/python-pdfbox/, that is the expected behavior.

The usage instructions indicate to instantiate the pdfbox object like so: p = pdfbox.PDFbox().

At this point, some of us seeking answers may encounter said HTTP Error in this question.

Looking into the repository, notice that the version of pdfbox to download is hardcoded. This would imply anyone who pip installs this package will need to be "lucky" enough to have the version of apache pdfbox (which is a java library) at the same version as that.

Solution:

Disclaimer: I sought to make this work for Windows 10.

The package init looks for pdfbox-app on the environment variable. If it does not find it, it tries to download one. Hence the error.

  1. Download the latest pdfbox-app-{version}.jar from pdfbox apache.
  2. Set the environment variable for PDFBOX e.g set PDFBOX=C:\Dev\pdfbox-app-2.0.11.jar
  3. Start a new command line and try:
    • import pdfbox
    • p = pdfbox.PDFBox()
    • p.extract_text("some_filename")

Caveat: extract_text() does not recognize spaces file names with spaces, somehow...

Sean Ang
  • 21
  • 3