2

I am trying to iterate through all tables in a document and extract the text from them. As an intermediate step I am just trying to print the text to the console.

I have looked at other code provided by scanny in similar posts but for some reason it is not giving me my expected output from the document I am parsing through

The document can be found at https://www.ontario.ca/laws/regulation/140300

from docx import Document
from docx.enum.text import WD_COLOR_INDEX
import os, re, sys

document = Document("path/to/doc")

tables = document.tables

for table in tables:

    for row in table.rows:

         for cell in row.cells:

              for paragraph in cell.paragraphs:
                   print(paragraph.text)

I expect this to print out all the text but instead I get nothing. if I try to print(row.cells) it just prints (). which is an empty list I guess. My document definetly does have text in the cells though. Not sure whats wrong here.

Any help is appreciated,

Joshua Vandenbor
  • 509
  • 3
  • 6
  • 19

3 Answers3

1

It's possible that the cell text is "contained" in a wrapper element that python-docx doesn't yet understand. The most common example is revision marks.

The most direct way to diagnose the problem is the inspect the XML for the table in question using opc-diag (as one option). But if it is revision marks, I believe accepting all revisions on the document will fix it, although I haven't actually tried that myself.

If that doesn't work and you post a sample of the table XML I can take a closer look.

scanny
  • 26,423
  • 5
  • 54
  • 80
  • Thanks, I'm not sure exactly how but I will try accepting the revision marks and see if that helps – Joshua Vandenbor Jan 11 '19 at 17:22
  • Depending on your Word version, you might find it in `Review ribbon > Accept > Accept All Changes and Stop Tracking` – scanny Jan 12 '19 at 07:01
  • I believe I am having this issue with revision marks. On my word, I tried turning off track changes and accept all changes in document but this left me with many missing cells. I have no prior xml experience so looking at the xml has been unenlightenting – Czarking Sep 08 '20 at 17:36
0

Found the error. I was using a third party tool (multiDoc converter) to convert old .Doc files into Docx format. works for the most part, however there must be some meta data that doesn't convert properly because it was causing the issue. Opening the file and manually saving it as Docx solved the issue. Only problem is that I want to convert 2000+ files into Docx, so I'll need to find another solution for convertiing the files.

Joshua Vandenbor
  • 509
  • 3
  • 6
  • 19
0

My document had hundreds of tables and only a few were coming out as empty (when in fact they weren't). So I tried to extract the data from the pdf version of the same document with tabula, same result: a few newly created tables were coming out empty!

After a bit of digging, I realized that my Word document was in "Track Changes" mode (to have the "change bars" indicate the difference with the previous version, and the tables themselves were an changes that were not accepted yet and that were the tables that didn't get extracted...

SOLUTION: In my case, I had to accept all changes to the document (in the "Review" tab of Word, in the "Accept" scroll-down, click "Accept All Changes") and saved the document again.

zukoj
  • 1
  • 1