Extract bold and underlined texts from .pdf

Question

I need to extract text from pdf. But the pdf has some bold and underlined texts. I tried MyPDF2 but getting error while trying to read those pdfs containing formatted texts.

    import PyPDF2
    pdf_file = open('Downloads/th.pdf','rb')
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    page = read_pdf.getPage(0)
    page_content = page.extractText()
    print (page_content)

Output

    ˘ˇˆˆ˝˛˚˜ ˜˚!˘˘ˇˆ˙˛˝˚˜˚ !ˆ"#$ˆ%&'˛"˝#$%˝˚'(˚˛)˛˝*+!-.$ˆ˚˛˚˛˘/˛˛0˛122/ 
    ˘˛˘˚˘˚2ˆ$".#$ˆ%˘˛˛$ˆ$%#$ˆ%˛˛˛˛˝˝(0/ 0$%˙˚˙3#"$˘--4˛0˚! 
    ˆ"#$ˆ%56272˛ˇ5'˛6222˛'4˘8(9˛(˜˚˛&˙˙˙˙˙

what did you try? can you post code? what was the error? did you really mean MyPDF2? I've used **P**yPDF2 with Python before… — Sam Mason, Jan 17 '19 at 12:26
I don't know with underlined texts but with regards to **bold** fonts, try this updated answer https://stackoverflow.com/questions/53398611/how-to-extract-bold-text-from-a-pdf-using-r/67963468#67963468 which the free software R. — venrey, Jun 14 '21 at 00:09

Sergey Ovsyannikov · Answer 1 · 2019-01-17T12:41:36.480

2

I was using Python 3.6 and the PyPDF2 moduele:

Get and install Python 3
Install PyPDF2 module using PIP. Run in terminal (or CMD/PowerShell in windows): pip install PyPDF2

Run this code in the python console as in the tutorial, for reading the PDF file and extracting the text:

import PyPDF2        
pdfFileObj = open('meetingminutes.pdf', 'rb')        
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)        
pageObj = pdfReader.getPage(0)        
pageObj.extractText()

edited Jan 17 '19 at 12:41

answered Jan 17 '19 at 12:36

Sergey Ovsyannikov

171
1
7

thanks. but i used the same code. but texts are bold and underlined and i am getting the output like this output: ˘ˇˆˆ˝˛˚˜ ˜˚!˘˘ˇˆ˙˛˝˚˜˚ !ˆ"#$ˆ%&'˛"˝#$%˝˚'(˚˛)˛˝*+!-.$ˆ˚˛˚˛˘/˛˛0˛122/˘˛˘˚˘˚2ˆ$".#$ˆ%˘˛˛$ˆ$%#$ˆ%˛˛˛˛˝˝(0/ 0$%˙˚˙3#"$˘--4˛0˚!ˆ"#$ˆ%56272˛ˇ5'˛6222˛'4˘8(9˛(˜˚˛&˙˙˙˙˙ – Nandhakumar Rajendran Jan 17 '19 at 12:39
`import PyPDF2 pdf_file = open('Downloads/th.pdf','rb') read_pdf = PyPDF2.PdfFileReader(pdf_file) number_of_pages = read_pdf.getNumPages() page = read_pdf.getPage(0) page_content = page.extractText() print (page_content) ` – Nandhakumar Rajendran Jan 17 '19 at 12:41
Strange, I was testing it on the PDF I have that also include bold and underlined text. Maybe your issue is due to subsetted (not embedded) font in your PDF file. Try using PdfToolBox https://www.callassoftware.com/en/products/pdftoolbox – Sergey Ovsyannikov Jan 17 '19 at 12:51
[Try this pdf](http://164.100.79.153/judis/chennai/index.php/casestatus/viewpdf/280647) – Nandhakumar Rajendran Jan 17 '19 at 13:09
I need to extract without using any tools for that. Is there any other way to read the PDFs. – Nandhakumar Rajendran Jan 20 '19 at 02:10

Extract bold and underlined texts from .pdf

1 Answers1