I used following code to convert multiple pdf files into txt
p
df_dir = "D:/search/pdf"
txt_dir = "D:/pdf_to_text"
corpus = (f for f in os.listdir(pdf_dir) if not f.startswith('.') and isfile(join(pdf_dir, f)))
pdfWriter = PyPDF2.PdfFileWriter()
for filename in corpus:
pdf = open(join(pdf_dir, filename),'rb')
pdfReader = PyPDF2.PdfFileReader(pdf)
for page in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(page)
pdfWriter.addPage(pageObj)
text = pageObj.extractText()
page_name = "{}-page{}.txt".format(filename[:4], page + 1)
with open(join(txt_dir, page_name), mode="w", encoding='UTF-8') as o:
o.write(text)
This code works properly, but for each file I have multiple pages , when I run above code it gives me data as file1-page1.txt, file1-page2.txt, file1-page3.txt. but I want file.txt contains information for all pages . How I can do it.