21

How can I read and process contents of every cell of a table in a DOCX file?

I am using Python 3.2 on Windows 7 and PyWin32 to access the MS-Word Document.

I am a beginner so I don't know proper way to reach to table cells. So far I have just done this:

import win32com.client as win32
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = False 
doc = word.Documents.Open("MyDocument")
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Aashiq Hussain
  • 543
  • 2
  • 8
  • 17

4 Answers4

44

Jumping in rather late in life, but thought I'd put this out anyway: Now (2015), you can use the pretty neat doc python library: https://python-docx.readthedocs.org/en/latest/. And then:

from docx import Document

wordDoc = Document('<path to docx file>')

for table in wordDoc.tables:
    for row in table.rows:
        for cell in row.cells:
            print cell.text
peterb
  • 891
  • 8
  • 7
  • 1
    This is a useful package for many purposes... but one big problem is that the text is given as one long list (paragraphs), and the tables as a second, and the images as a third, without any indication of how they are ordered together. There is an attempt to engineer out of this problem at https://github.com/kmrambo/Python-docx-Reading-paragraphs-tables-and-images-in-document-order- ... but it is *incredibly* slow at processing the documents. If ordering of all elements is needed, you probably have to use Mike Robins' approach. – mike rodent Apr 16 '21 at 19:16
  • This seems to only work on up to python 3.4. Not sure why so popular? – The Kraken Mar 02 '22 at 03:25
  • This library doesn't handle merged cells in a row – easythrees Aug 24 '23 at 01:09
26

Here is what works for me in Python 2.7:

import win32com.client as win32
word = win32.Dispatch("Word.Application")
word.Visible = 0
word.Documents.Open("MyDocument")
doc = word.ActiveDocument

To see how many tables your document has:

doc.Tables.Count

Then, you can select the table you want by its index. Note that, unlike python, COM indexing starts at 1:

table = doc.Tables(1)

To select a cell:

table.Cell(Row = 1, Column= 1)

To get its content:

table.Cell(Row =1, Column =1).Range.Text

Hope that this helps.

EDIT:

An example of a function that returns Column index based on its heading:

def Column_index(header_text):
for i in range(1 , table.Columns.Count+1):
    if table.Cell(Row = 1,Column = i).Range.Text == header_text:
        return i

then you can access the cell you want this way for example:

table.Cell(Row =1, Column = Column_index("The Column Header") ).Range.Text
YusuMishi
  • 2,317
  • 1
  • 18
  • 8
  • Thank You Very Much That Worked For Me. I Have One More Question, Is There A Way To Access A Table Cell By Its Column Heading And Row No. ? Thanks Again :) – Aashiq Hussain Apr 30 '12 at 10:07
  • I think Column Headings in Ms Word are regular cells. They should just be the first row of the table. However you can write a function that returns Column index. I'll edit my answer to show you an example. – YusuMishi Apr 30 '12 at 17:42
  • This code does not catch the table that could be in the header.. Do you have a solution for this?? I would really appreciate your help, thank you – Norfeldt May 08 '13 at 13:06
  • Could you pls elaborate! What do you mean by 'catch the table that could be in the header'? – YusuMishi May 08 '13 at 16:05
  • @YusuMishi yes of course - check this out http://stackoverflow.com/questions/16485343/reading-table-contet-in-header-and-footer-in-ms-word-file-using-python – Norfeldt May 10 '13 at 15:06
  • @YusuMishi if you can take a look [here](http://stackoverflow.com/questions/24055315/manage-complex-word-table-with-pywin32-in-python-3-4) I think maybe you can help me. Problem with ms word tables dimensions. – Yann Jun 23 '14 at 08:38
  • Any suggestions on how to create a table and stylize table cells? – Yebach Nov 06 '14 at 08:28
  • @YusuMishi how can I use this to check if any of the cells are void? – feltersnach Oct 31 '18 at 20:37
19

I found a simple code snippet on a blog Reading Table Contents Using Python by etienne

The great thing about this is that you don't need any non-standard python libraries installed.

The format of a docx file is described at Open Office XML.

import zipfile
import xml.etree.ElementTree

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
TABLE = WORD_NAMESPACE + 'tbl'
ROW = WORD_NAMESPACE + 'tr'
CELL = WORD_NAMESPACE + 'tc'

with zipfile.ZipFile('<path to docx file>') as docx:
    tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))

for table in tree.iter(TABLE):
    for row in table.iter(ROW):
        for cell in row.iter(CELL):
            print ''.join(node.text for node in cell.iter(TEXT))
Community
  • 1
  • 1
Mike Robins
  • 1,733
  • 10
  • 14
  • 1
    Thanks! Do you know how to handle merged cells? For example, I have a table with 2 rows and 3 columns but the last row's first two columns are merged. The result of the above code is that the content of the third column is read as the second column, not the third. – Shani Shalgi Feb 04 '19 at 08:51
  • @Shani, it has been a couple of years since I was looking at Excel files. You could unzip the word doc and examine the structure of your merged cells and modify the code above. Alternatively, there is much better support for Microsoft documents in python since I wrote this. You might do better using one of the python modules. I am not in a position to recommend any specifically. – Mike Robins Feb 05 '19 at 03:14
  • 1
    Actually the code above was much simpler and covered more cases than the modules I found on the web (e.g tabula does not read fields if they are used in a table and therefore whole tables are distorted., but I also tried other packages). I will have a look inside the structure. – Shani Shalgi Feb 05 '19 at 09:03
0

Aashiq Hussain, I ran into the same problem as Shani Shalgi (vertically-merged cells) but resolved with the below, commented code! Enjoy!

#source: combination of ChatGPT (regarding stackoverflow's policy and the spirit of the policy, this was for a STARTING POINT ONLY; the code was heavily modified afterwards for quality purposes; I must however give credit to the amazing bot for what it provided!), stackoverflow.com, and God
#function: concatenates all tables (even with vertically-merged cells!) in all Word documents in 1 folder  and  outputs into 1 massive Excel table
#tip: if you just change the dir_path and output_path at the top, rest 'just works'   (it will create a pickle file storing contents in your directory so you can easily access later if needed! it will also open up the 2 created files at the end for your ease!)
#tip.2: if you happen to get an error like 'does not support enumeration' around line 42, that means 1 or more of the tables in 1 or more Word documents (I've littered the code with print statements so you'll know right before failure which one it is) has a table that doesn't behave (I only had 1 out of 100+ tables so I just moved it to a temp Excel manually and pasted it manually into the final massive Excel after this code ran)
#system specs: Windows 10, python 3.10, pandas 1.5.3, pywin32 304
import os,time
import pandas as pd
import win32com.client as win32

# Set the directory path for the Word documents
# dir_path = r"C:\path\to\folder\containing\Word\documents"
dir_path = r"C:\Users\pablodumas\Downloads\All_.Words.20230421"#this contains all Word docs (can contain more, but will only pick up .docx and .doc files to process)
output_path0=r'C:\Users\pablodumas\Downloads\All_.Words.20230421.finalTable1.xlsx'#this is the final Excel to write to and open

# Initialize an empty list to hold the tables from each document
tables_list = []

# Loop through each file in the directory
for filename in os.listdir(dir_path):#typical
    # Check if the file is a Word document
    if filename.endswith(".docx") or filename.endswith(".doc"):#from above, will only pick up .docx and .doc files to process
        # Create a full path to the file
        filepath = os.path.join(dir_path, filename)
        print('filepath')
        print(filepath)
        if os.path.basename(filepath).startswith(r'~'):#omits temporary, hidden files, which love to cause errors, from further processing
            continue
        
        # Initialize a Word application object and open the document
        # word = win32.gencache.EnsureDispatch("Word.Application")#did NOT work as was waiting to respond and raised error due to block ?
        word = win32.DispatchEx("Word.Application")#so opened each Word in unique instance instead
        doc = word.Documents.Open(filepath)
        print('doc')
        print(doc)


        # Loop through each table in the document and convert it to a pandas DataFrame
        for i in range(1, doc.Tables.Count+1):
            tbl = doc.Tables(i)
            data = []
            keys = []
            num_cols = tbl.Columns.Count
            for row_idx, row in enumerate(tbl.Rows):
                # Check if the row contains vertically merged cells
                is_merged = True#we are going to treat EVERY cell as a merged cell (if it's not merged, great, nothing much happens; if it's merged, it will concatenate the contents to result in 1 cell)
                # If the row is merged, split it into multiple rows
                merged_data = []
                for cell_idx, cell in enumerate(row.Cells):
                    try:
                        merged_data.append(cell.Range.Text.strip())
                        if row_idx == 0:#if the row is 0 (usually this is where the column headers are), then treat them as headers by appending them to keys
                            keys.append(cell.Range.Text.strip())
                    except:
                        merged_data[-1] += "\n" + cell.Range.Text.strip()#if causes error (which means vertically-merging), then concatenate the string with new lines (isn't try...except so great!)
                # If the merged row has too few columns, add empty cells to the end
                while len(merged_data) < num_cols:
                    merged_data.append("")
                data.append(merged_data)
            tbl_df = pd.DataFrame(data, columns=keys)
            print(tbl_df)
            print(tbl_df.columns)
            print(tbl_df.applymap(lambda x:str(x).replace('\r','').replace('\x07','').replace('\x0B','')))#in my example, there were a lot of no-no characters (e.g. \r,\x07,\x0B which are carriageReturn,bell,verticalTab) that Excel throws error if writing them so replaced in data with ''   (this and the below 'replace' you may have to tweak if you still have left over characters; .csv and VS Code are your friends (write to .csv, copy output to VS Code new file, it should highlight in bright red what are naughty characters!))
            tbl_df=tbl_df.applymap(lambda x:str(x).replace('\r','').replace('\x07','').replace('\x0B',''))
            print(tbl_df.columns.str.replace('\r','').str.replace('\x07','').str.replace('\x0B',''))#same as above, but replacing in columns
            tbl_df.columns=tbl_df.columns.str.replace('\r','').str.replace('\x07','').str.replace('\x0B','')
            tbl_df['sourceFile0']=os.path.basename(filename)#adding which Word doc the particular pandas.DataFrame data came from (so when it ends up in 1 massive Excel, you can tell which came from where)

            # Append the DataFrame to the list
            tables_list.append(tbl_df)


        # Close the document and Word application
        doc.Close()
        word.Quit()

# Concatenate all the tables into a single DataFrame
combined_df = pd.concat(tables_list, ignore_index=True)
pathOfThisFileThatIsRunningRightHere0=os.path.abspath(__file__)
pickleDumpPath0=pathOfThisFileThatIsRunningRightHere0+time.strftime('%Y%m%d')+'.2.pickle'
combined_df.to_pickle(pickleDumpPath0)
print('combined_df')
print(combined_df)
print(combined_df.applymap(lambda x:str(x).encode('ascii','ignore').decode('ascii')))#similar to above, getting rid of non-ascii characters since Excel doesn't like some non-ascii characters / throws an error when writing
combined_df=combined_df.applymap(lambda x:str(x).encode('ascii','ignore').decode('ascii'))
combined_df=combined_df.drop_duplicates()#don't want duplicates in data (especially since the data I was working with placed the headers in the actual data sometimes!)
print('pickleDumpPath0')
print(pickleDumpPath0)

# Write the DataFrame to an Excel file
output_path1=output_path0+'.csv'#test in .csv (since .csv can handle some characters Excel can't)
combined_df.to_csv(output_path1, index=False)
os.startfile(output_path1)#open up .csv
combined_df.to_excel(output_path0, index=False)#final Excel
os.startfile(output_path0)#open up Excel

#py -3.10 "C:\Users\pablodumas\Documents\Code\loopThroughDirectoryWordFilesAndReadFilesTablesInto1TableOutputToExcel.redacted.py"