0

I have a directory with pdf files that I want to extract text from (each file individually) and put them into individual .txt files with the same name as the original pdf file.

Example: Directory X contains 'name1.pdf', 'name2.pdf', and 'name3.pdf'

What I want to do is take the text from those files and put them into files called 'name1.txt', 'name2.txt', and 'name3.txt'

What I have so far:

import textract
import glob

for pdf in glob.glob('//home//user//Desktop//X//*.pdf'):
    text = textract.process(pdf)

txtFile = open(...,'w') # confused here
txtFile.write(text)

Thanks in advance!

wra
  • 237
  • 4
  • 7
  • 18
  • Possible duplicate of [Find all files in directory with extension .txt in Python](http://stackoverflow.com/questions/3964681/find-all-files-in-directory-with-extension-txt-in-python) – Tony Tannous Feb 15 '17 at 16:13
  • No one has said this. but hope you know pdf's are not plain text files? – danidee Feb 15 '17 at 16:19

2 Answers2

2

So I hope I'm understanding you correctly, and if I am this should help.

import fnmatch
import os

def walk_directories(self, Dir, pattern):
        root = Dir
        for root, directories, files in os.walk(Dir):
            for basename in files:
                if fnmatch.fnmatch(basename, pattern):
                    _file_path = os.path.join(root, basename)
        return _file_path

This was made for a different purpose but it should suit your needs as well, I got this going to locate files contained in "unknown" sub-directories contained within a single root directory. All you need to know is the filename and the root directory( main folder) this will work with partial filenames as well, essentially if you've got three files named for instance "pdf1", "pdf2", and "pdf3" all you need to do is supply that to the pattern parameter.

In honesty, this seems more like overkill if you know the directories and files you're working with you could do it a lot easier but with this, it's pretty straight forward.

Essentially you supply the folder path in the "Dir" Parameter and the filename in the Patter parameter

walk_directories("C:\\Example folder", "Example File.pdf") # or simply "pdf1" etc..

You'll note this function returns a variable which is, in this case, the full file path of what you're working with.

_path = walk_directories("C:\\example folder", "example file.pdf")

_path would then contain

C:\\example folder\\example file.pdf

So you could something like

def read(self, path):
        try:
            if os.path.isfile(path):
                with open(path, 'r') as inFile:
                    temp = inFile.read()
        except IOError as exception:
            raise IOError('%s: %s' % (path, exception.strerror))
        return temp

The "path" parameter would in this case be _path the resulting variable returned (temp) would be the text that was contained in the file from there it's as simple as

def write(self, path, text):
        try:
            if os.path.isfile(path):
                return None
            else:
                with open(path, 'w') as outFile:
                    outFile.write(text)
        except IOError as exception:
            raise IOError("%s: %s" % (path, exception.strerror))

        return None

so here it's pretty straight forward as well supply the path and the variable containing the text you want to write.

suroh
  • 917
  • 1
  • 10
  • 26
0

First, each iteration in your first loop you override the text variable..

You can use os.path.basename(path) in order to get the filename.

Basically, what you need is:

import os
for pdf in glob.glob('//home//user//Desktop//X//*.pdf'):
    text = textract.process(pdf)
    with open(os.path.basename(pdf)[:-4] + ".txt", "w") as f: 
         f.write(text) 

You can do this at the same loop, this way you loop through the pdf's, and write each one to a txt file using the os lib in order to have the basename.

omri_saadon
  • 10,193
  • 7
  • 33
  • 58
  • Still very new to python (2 months in) and I really needed this for work and it worked like a charm! Thank you very much! I will have to read up more on the os module as I see its used very often – wra Feb 15 '17 at 17:01
  • Hello again Omri. I attempted to do this with PowerPoints but it gives me the following error: text = textract.process(ppt) NameError: name 'ppt' is not defined – wra Feb 22 '17 at 11:08
  • @wra , Hi, I will have to see the whole code and analyze it. I think it's a different question, I would suggest to open an new question for this that include all information. – omri_saadon Feb 22 '17 at 11:12