0

I am trying to extract heading 1 from documents stored in a directory.

I am extremely new to python, so my experience is extremely limited.

My code below does not work, it has syntax and structural errors.

The code returns an error document not defined.

import os

from docx import Document


#document = Document('C:\\Users\\Work\\Desktop\\Docs')

mydir ="C:\\Users\\Work\\Desktop\\Docs\\"
for arch in os.listdir(mydir):
archpath = os.path.join(mydir, arch)
with open(archpath) as f:

    for paragraph in document.paragraphs:

     if paragraph.style.name == 'Heading 1':

      print(paragraph.text)

    document.save = Document('headings.docx')

I have researched both on stack and on the internet, but I have not found anything that shows how to loop through documents in a folder.

Have I set the code up in the correct manner? How can I loop through documents in a directory and extract the headings 1 to a new document.

Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284
SA90
  • 5
  • 1
  • 4

1 Answers1

0

To get a list of files that you can iterate over, you could use:

import os
os.chdir("path/to/files")
lists_of_files = os.listdir(os.getcwd())

and then

for i in list_of_files:
    #extract heading from file i

For extracting the headers, you can use python's native docx module. The link points to a SO answer where you can find a way of getting the entire data from the doc file. In this manner, you could get the heading. Haven't tried those methods though.

Community
  • 1
  • 1
Carlo Mazzaferro
  • 838
  • 11
  • 21