-3

I need to extract text from pdf. But the pdf has some bold and underlined texts. I tried MyPDF2 but getting error while trying to read those pdfs containing formatted texts.

    import PyPDF2
    pdf_file = open('Downloads/th.pdf','rb')
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    page = read_pdf.getPage(0)
    page_content = page.extractText()
    print (page_content)

Output

    ˘ˇˆˆ˝˛˚˜ ˜˚!˘˘ˇˆ˙˛˝˚˜˚ !ˆ"#$ˆ%&'˛"˝#$%˝˚'(˚˛)˛˝*+!-.$ˆ˚˛˚˛˘/˛˛0˛122/ 
    ˘˛˘˚˘˚2ˆ$".#$ˆ%˘˛˛$ˆ$%#$ˆ%˛˛˛˛˝˝(0/ 0$%˙˚˙3#"$˘--4˛0˚! 
    ˆ"#$ˆ%56272˛ˇ5'˛6222˛'4˘8(9˛(˜˚˛&˙˙˙˙˙
  • what did you try? can you post code? what was the error? did you really mean MyPDF2? I've used **P**yPDF2 with Python before… – Sam Mason Jan 17 '19 at 12:26
  • I don't know with underlined texts but with regards to **bold** fonts, try this updated answer https://stackoverflow.com/questions/53398611/how-to-extract-bold-text-from-a-pdf-using-r/67963468#67963468 which the free software R. – venrey Jun 14 '21 at 00:09

1 Answers1

2

I was using Python 3.6 and the PyPDF2 moduele:

  1. Get and install Python 3
  2. Install PyPDF2 module using PIP. Run in terminal (or CMD/PowerShell in windows): pip install PyPDF2
  3. Run this code in the python console as in the tutorial, for reading the PDF file and extracting the text:

    import PyPDF2        
    pdfFileObj = open('meetingminutes.pdf', 'rb')        
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)        
    pageObj = pdfReader.getPage(0)        
    pageObj.extractText()
    
  • thanks. but i used the same code. but texts are bold and underlined and i am getting the output like this output: ˘ˇˆˆ˝˛˚˜ ˜˚!˘˘ˇˆ˙˛˝˚˜˚ !ˆ"#$ˆ%&'˛"˝#$%˝˚'(˚˛)˛˝*+!-.$ˆ˚˛˚˛˘/˛˛0˛122/˘˛˘˚˘˚2ˆ$".#$ˆ%˘˛˛$ˆ$%#$ˆ%˛˛˛˛˝˝(0/ 0$%˙˚˙3#"$˘--4˛0˚!ˆ"#$ˆ%56272˛ˇ5'˛6222˛'4˘8(9˛(˜˚˛&˙˙˙˙˙ – Nandhakumar Rajendran Jan 17 '19 at 12:39
  • `import PyPDF2 pdf_file = open('Downloads/th.pdf','rb') read_pdf = PyPDF2.PdfFileReader(pdf_file) number_of_pages = read_pdf.getNumPages() page = read_pdf.getPage(0) page_content = page.extractText() print (page_content) ` – Nandhakumar Rajendran Jan 17 '19 at 12:41
  • Strange, I was testing it on the PDF I have that also include bold and underlined text. Maybe your issue is due to subsetted (not embedded) font in your PDF file. Try using PdfToolBox https://www.callassoftware.com/en/products/pdftoolbox – Sergey Ovsyannikov Jan 17 '19 at 12:51
  • [Try this pdf](http://164.100.79.153/judis/chennai/index.php/casestatus/viewpdf/280647) – Nandhakumar Rajendran Jan 17 '19 at 13:09
  • I need to extract without using any tools for that. Is there any other way to read the PDFs. – Nandhakumar Rajendran Jan 20 '19 at 02:10