2

Problem:

import tabula as tb
import pandas as pd

other = "https://github.com/chezou/tabula-py/raw/master/tests/resources/data.pdf"
dfs = tb.read_pdf(other, stream=True) #this works

file="D:\Favorites\1. Programming\Projects\cell penetrating peptide supplemental.pdf"
tables = tb.read_pdf(file, pages = "all", multiple_tables = True)
tables

output:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-29-c598474e8fa3> in <module>
      6 
      7 file="D:\Favorites\1. Programming\Projects\cell penetrating peptide supplemental.pdf"
----> 8 tables = tb.read_pdf(file, pages = "all", multiple_tables = True)
      9 tables

~\anaconda3\lib\site-packages\tabula\io.py in read_pdf(input_path, output_format, encoding, java_options, pandas_options, multiple_tables, user_agent, **kwargs)
    312 
    313     if not os.path.exists(path):
--> 314         raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), path)
    315 
    316     if os.path.getsize(path) == 0:

FileNotFoundError: [Errno 2] No such file or directory: 'D:\\Favorites\x01. Programming\\Projects\\cell penetrating peptide supplemental.pdf'

It seems like everyone else who had this issue didn't get it resolved.

The first advice I followed was to check that the file actually exists.

file=r"D:\Favorites\1. Programming\Projects\cell penetrating peptide supplemental.pdf"

print( os.path.isfile(file))
print(os.path.exists(file))
print(os.path.getsize(file) == 0)

output:

True
True
False

??????? why is it raising an error that it should only raise if print(os.path.exists(file)) is False?

I tried a file from the internet and it worked perfectly. The file I'm trying to read doesn't have a URL. I can't view it from my browser. I only have the option to download it. Otherwise i'd just try feeding its URL into the function.

UPDATE: I tried the suggested solution

import tabula as tb
import pandas as pd


tables = tb.read_pdf(r"D:\Favorites\1. Programming\Projects\cell penetrating peptide supplemental.pdf", pages = "all", multiple_tables = True)
tables

and got this:

Got stderr: Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 4 (33) in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 3 (34) in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (35) in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (36) in font PKLNYU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font FLAXFE+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (33) in font FLAXFE+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (34) in font FLAXFE+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 4 (33) in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 3 (34) in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (35) in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (36) in font BPOUDD+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font DCUQIG+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (33) in font DCUQIG+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font DREOWG+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (33) in font DREOWG+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font EWGNLJ+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (33) in font EWGNLJ+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (34) in font EWGNLJ+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font PUHGFM+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (33) in font PUHGFM+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (34) in font PUHGFM+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 4 (33) in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 3 (34) in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (35) in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 2 (36) in font UHIZXI+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDFont <init>
WARNING: Invalid ToUnicode CMap in font UCENHU+CambriaMath
Jun 28, 2020 11:17:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for 1 (33) in font UCENHU+CambriaMath
ellie-lumen
  • 197
  • 2
  • 9

1 Answers1

3

The problem is that tabula-py has a localize_file function that is called in read_pdf. localize_file will invoke os.path.expanduser to expand the path. For example, in Unix-like systems, "~" is an alias for the user home directory. Thus os.path.expanduser will do the following expansion in Mac OS X

>>> os.path.expanduser("~/Documents")
'/Users/username/Documents'

Unfortunately, this function has another effect: it treats \ as an escape symbol for ANSI escape codes since it invokes os.fspath inside the function. so if you run

>>> os.path.expanduser("\125")
'U'
>>> os.fspath("\125")
'U'

In your case, \1 in the path has been escaped to \x01 so Windows cannot find such a directory. In order to keep your path unchanged, pass it as a raw string, i.e. put an r before it like this

>>> os.path.expanduser(r"\125")
'\\125'

references:

tabula's read_pdf line 311 localize_file is invoked

tabula's localize_file line 72 os.path.expanduser is invoked

Python's expanduser line 293 fspath is invoked

a reference to ANSI escape sequences

whilrun
  • 1,701
  • 11
  • 21
  • Wow that's really cool. I really appreciate how much your answer demystified this escape character problem for me. Thank you for breaking it down like that. Do you know why get this new error? It's updated in the post. @whilrun – ellie-lumen Jun 28 '20 at 16:21
  • 1
    @ellie-lumen You really don't need to bother these as they are common warnings when extracting a pdf document. ToUnicode CMap is provided by PDF to convert the PDF's CID font to Unicode entrypoint so you can extract it with unicode. However some malformed PDF may contain badly formed CMap or itself is malformed so some characters cannot be converted or some invalid characters in the PDF document. If you want to fix it by yourself, [this post](https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0) will help, but it generally doesn't worth the time. – whilrun Jun 28 '20 at 17:35