I am trying to translate PDFs files using translation API and output it as PDF by keeping the format same. My approach is to convert the PDF to word doc and to translate the file and then convert it back to PDF. But the problem, is there no efficient way to convert the PDF to word. I am trying to write my own program but the PDFs has lots of formats. So I guess it will take some effort to handle all the formats. So my question, is there any efficient way to translate there PDFs without losing the format or is there any efficient way to convert them to docx. I am using python as programing language.
-
Try referring this answer: https://stackoverflow.com/questions/26358281/convert-pdf-to-doc-python-bash – Daniel Isaac Jul 12 '18 at 11:03
-
@DanielIsaac thank for reply but i tried this solution current libreoffice doesn't support this feature. – Jul 12 '18 at 11:10
3 Answers
Probably not.
PDFs aren't meant to be machine readable or editable, really; they describe formatted, laid-out, printable pages.

- 152,115
- 15
- 115
- 172
You can use pdfminer instead of API here an example:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text

- 51
- 4
PDF is (typically) not a structured data format. What I mean by that is, a PDF document (typically) contains no notion of "these words form a sentence" or "these sentences make up a paragraph" or "this content is the first row of the second column of this table"
Simplified, a PDF contains something like this:
- Go to position 40, 120
- Set the stroke color to black
- Set the active font to Helvetica, in size 12
- Draw the glyph for the character "H"
- Go to position 47, 120
- Draw the glyph for the character "e"
In short, the viewing software (and libraries reading the PDF) typically only know "an H was drawn at ..." and "an e was drawn at ..".
It takes some lucky guesswork to be able to determine whether those two instructions belong together. You can do things like "what would be the width of the space character in the font that was used to draw H? Is the e closer to the H than that width?"
But even that breaks down rather easily. PDF has the concept of subset fonts. Which you can think of as "the pdf contains a frankenstein font, that only knows about the characters it needs".
And because you can simply not render the "space" character (and instead just move the drawing cursor), there's no need to give this subset font any information about the "space" character (or the width thereof).

- 8,483
- 2
- 23
- 54