0

I am using python 3. My code uses pdfminer to convert pdf to text. I want to get the output of these files in a new folder. Currently it's coming in the existing folder from which it does the conversion to .txt using pdfminer. How do I redirect the output to a different folder. I want the output in a folder called "D:\extracted_text" Code till now:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
import glob
import os

def convert(fname, pages=None):
   if not pages:
       pagenums = set()
   else:
       pagenums = set(pages)

   output = StringIO()
   manager = PDFResourceManager()
   converter = TextConverter(manager, output, laparams=LAParams())
   interpreter = PDFPageInterpreter(manager, converter)

   infile = open(fname, 'rb')
   for page in PDFPage.get_pages(infile, pagenums):
       interpreter.process_page(page)
   infile.close()
   converter.close()
   text = output.getvalue()   
   output.close

   savepath = 'D:/extracted_text/'
   outfile = os.path.splitext(fname)[0] + '.txt'
   comp_name = os.path.join(savepath,outfile)
   print(outfile)
   with open(comp_name, 'w', encoding = 'utf-8') as pdf_file:
       pdf_file.write(text)

   return text    



directory = glob.glob(r'D:\files\*.pdf')  

for myfiles in directory:  
     convert(myfiles)
ajai biltu
  • 55
  • 6
  • Possible duplicate of [Telling Python to save a .txt file to a certain directory on Windows and Mac](https://stackoverflow.com/questions/8024248/telling-python-to-save-a-txt-file-to-a-certain-directory-on-windows-and-mac) – Zaraki Kenpachi Jun 06 '19 at 17:33

2 Answers2

1

you can use os.path,join, you have to give your directory path and filename with extension. it will create a full url and creates a file. You can use it like below

with open(os.path.join(dir_path,fileCompleteName), "w") as file1:
        file1.write("Hello World")

In windows any of the below should work

"D:/extracted_text/"
os.path.join("/", "D:", "extracted_text", outfile)
os.path.join("D:/", "extracted_text", outfile)

Make sure directory path is exist "D:/extracted_text"

Manoj Kumar
  • 745
  • 2
  • 8
  • 29
1

The problem lies in line:

outfile = os.path.splitext(os.path.abspath(fname))[0] + '.txt'

If you print out outfile, you'll see that it contains the full path of your file. Replace it with:

outfile = os.path.splitext(fname)[0] + '.txt'

This should solve your problem! Note that this will break if 'D:/extracted_text/' does not exist. So either create that directory manually or programmatically using os.makedir.

EDIT: To break down the problem into smaller pieces, open a new file and run this snippet, see if it does the trick, then make the changes in the original code:

import os

fname = "some_file.pdf"
text = "Here's the extracted text"
savepath = 'D:/extracted_text/'
outfile = os.path.splitext(fname)[0] + '.txt'
print(outfile)
comp_name = os.path.join(savepath,outfile)
print(comp_name)

with open(comp_name, 'w', encoding = 'utf-8') as pdf_file:
    pdf_file.write(text)
Sadi
  • 50
  • 1
  • 9
  • Did the same but it's still generating the converted .txt files in this directory 'D:\files' – ajai biltu Jun 06 '19 at 18:26
  • Any idea why this is still happening? – ajai biltu Jun 06 '19 at 18:29
  • I've edited my answer which mimics the file writing process without the text extraction part. Let me know if that works. – Sadi Jun 06 '19 at 18:37
  • Yes this works. But why is the same thing inside the function not working? – ajai biltu Jun 06 '19 at 18:52
  • Hmmm, it's hard to tell. May I suggest that you refactor your routine? Let the convert() function only handle the conversion part and use another function, let's say, writeToFile() from the snippet in my answer? Bigger routines are a source of errors in general and they are harder to debug as well. – Sadi Jun 08 '19 at 05:12