Goal: extract Chinese financial report text
Implementation: Python pdfplumber/pdfminer package to extract PDF text to txt
problem: for PDF text in bold, corresponding extracted text in txt duplicates
Examples are as follows:
Such as the following PDF text:
Python extracts to txt as:
And I don't need to repeat the text, just normal text.
How should I do it, should I change the package or add a new function?
Please see the code and original pdf text below.
Additional: pdfplumber code:
import pdfplumber
def pdf2txt(filename, delLinebreaker=True):
pageContent = ''
showplace = ''
try:
with pdfplumber.open( filename ) as pdf:
page_count = len(pdf.pages)
for page in pdf.pages:
if delLinebreaker==True:
pageContent += page.extract_text().replace('\n', "")
else:
pageContent += page.extract_text()
except Exception as e:
print( "file: ", filename, ', reason: ', repr(e) )
return pageContent
pdf2txt(r"report.pdf", delLinebreaker=False)
pdfminer code:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.pdfpage import PDFPage
rsrcmgr = PDFResourceManager()
outfp = open(r"report.txt", 'w', encoding='utf-8')
device = TextConverter(rsrcmgr, outfp)
with open(r"Report.pdf", 'rb') as fp:
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
device.close()
outfp.close()
the pdf file can download here in Shenzhen Stock Exchange official website http://www.szse.cn/disclosure/listed/bulletinDetail/index.html?9324ce3c-6072-499d-8798-b25d641b52ec