Tabula-py not extracting tables correctly

Question

I was building an API that uses tabula to extract table from a pdf.

I built the API on the windows machine and deployed it on ubuntu 20.

On the windows machine the extraction was flawless, and I was able to perform all the necessary steps. However, after deploying the FastAPI on the Ubuntu server the extraction is incorrect.

I tried providing different parameters, but none works. The PDF contains a tables with no horizontal and vertical lines.

The extracted table on my windows machine looks something like:

The extracted table on the ubuntu looks like this

My Code looks like this:

area1 = [210,10, 750, 570]
area2 = [130,10, 750, 570]
columns = [75, 250, 300, 370, 440, 530]

tables1 = tabula.read_pdf(filepath, guess=False, lattice=False, 
                 stream=True, multiple_tables=True, area=area1, pages=1, columns=columns) 
tables2 = tabula.read_pdf(filepath, guess=False, lattice=False, 
                 stream=True, multiple_tables=True, pages=list(range(2, pages_count+1)), area=area2, columns=columns)

I don't know what's causing this issue, especially for this particular PDF. Even after trying multiple combination of parameters and googling I failed to get the desired result(result in my local Windows Machine).

Tabula-py not extracting tables correctly

0 Answers0