Python, using pdfplumber, pdfminer packages extract text from pdf, bolded characters duplicates

Question

Goal: extract Chinese financial report text

Implementation: Python pdfplumber/pdfminer package to extract PDF text to txt

problem: for PDF text in bold, corresponding extracted text in txt duplicates

Examples are as follows:

Such as the following PDF text: Python extracts to txt as:

And I don't need to repeat the text, just normal text.

How should I do it, should I change the package or add a new function?

Please see the code and original pdf text below.

Additional: pdfplumber code:

import pdfplumber
 
def pdf2txt(filename, delLinebreaker=True):
    pageContent = ''
    showplace = ''
    try:    
        with pdfplumber.open(  filename  ) as pdf:
            page_count = len(pdf.pages)
            for page in pdf.pages:
                if delLinebreaker==True:
                    pageContent += page.extract_text().replace('\n', "")   
                else:
                    pageContent += page.extract_text()  
    except Exception as e:
        print( "file: ", filename, ', reason: ', repr(e) )
    return pageContent
 
pdf2txt(r"report.pdf", delLinebreaker=False)

pdfminer code:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.pdfpage import PDFPage
 
rsrcmgr = PDFResourceManager()
outfp = open(r"report.txt", 'w', encoding='utf-8')
device = TextConverter(rsrcmgr, outfp)
with open(r"Report.pdf", 'rb') as fp:
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
device.close()
outfp.close()

Result of pdfminer is:

the pdf file can download here in Shenzhen Stock Exchange official website http://www.szse.cn/disclosure/listed/bulletinDetail/index.html?9324ce3c-6072-499d-8798-b25d641b52ec

The text is **_not bold_** but simulates bold by repeatedly writing same text in *almost* the same positions. So you must find a way to detect this and react accordingly. On first page you have 4 times `浙江精功科技股份有限公司` with 4 boundary boxes *almost* equal to `(189.36, 186.98, 406.49, 252.30)`. — Jorj McKie, Apr 10 '23 at 11:02

Jorj McKie · Answer 1 · 2023-04-10T13:33:27.413

Using PyMuPDF, you are able to suppress pseudo-bold text like for example this:

import fitz  # import PyMuPDF

doc = fitz.open("input.pdf")
page = doc[0]  # example first page

# extract text including its coordinates
blocks = page.get_text("dict", sort=True, flags=fitz.TEXTFLAGS_TEXT)["blocks"]
old_bbox = fitz.EMPTY_RECT()  # store previous bbox here
old_text = ""  # store previous text here
for b in blocks:  # loop over text blocks
    for l in b["lines"]:  # lines in current block
        bbox = fitz.Rect(l["bbox"])  # line boundary box
        # text in line - remove leading trailing spaces where possible
        text = " ".join([s["text"].strip() for s in l["spans"]]).strip()
        # check if new bbox overlaps old bbox
        isect = abs(bbox & old_bbox) / abs(bbox)  # overlap ratio
        if text != old_text or isect < 0.5:  # text unequal or no overlap
            print(text)  # print text
        old_text = text  # store for next
        old_bbox = +bbox  # store for next

Previous code delivers this:

浙江精功科技股份有限公司 2017 年年度报告全文

浙江精功科技股份有限公司
2017 年年度报告
年年度报告
2018 年
年
年 04 月
月
1

instead of this:

浙江精功科技股份有限公司 2017 年年度报告全文

浙江精功科技股份有限公司
浙江精功科技股份有限公司
浙江精功科技股份有限公司
浙江精功科技股份有限公司
2017 年年度报告
年年度报告
年年度报告
年年度报告
2018 年
年
年
年 04 月
月
月
月
1

As you can see, there still are some duplications - even with the corrective logic: After 2017 年年度报告 follows 年年度报告, which probably duplicates the Chinese part of the previous. So to also catch these cases, your logic needs to be smarter still and also check for partial bbox overlaps and trailing text equality, like if old_text.endswith(text) .... Doing this delivers a better result:

浙江精功科技股份有限公司 2017 年年度报告全文

浙江精功科技股份有限公司
2017 年年度报告
2018 年
年 04 月
1

But still, character 年 is duplicated between "2018" and "04". I think you get the point.

Python, using pdfplumber, pdfminer packages extract text from pdf, bolded characters duplicates

1 Answers1