1

I could really use some help in how to extract data from Word documents using SSIS and inserting the extracted data in SQL. There are 10,000 - 13,000 Word files to process. The files most likely aren't consistent over the years. Any help is greatly appreciated!

Below is the example data from the Word documents that I'm interested in capturing. Note that Date and Job No are in the Header section.

Customer : Test Customer Customer Ref. : 123456 Contact : Test Contact

Part No. : 123456789ABCDEFG Manufacturer : Some Mfg. Package : 123-456 Date Codes : 1234 Lot Number : 123456 Country of Origin : Country Total Incoming Qty : 1 pc XRF Test Result : PASS HCT Result : PASS Solder Test Result : PASS

enter image description here

Hadi
  • 36,233
  • 13
  • 65
  • 124
outdoorsman33
  • 115
  • 11

2 Answers2

1

My approach would be this:

  1. Create a script in Python that extracts your data from the Word files and save them in XML or JSON format
  2. Create SSIS package to load the data from each XML/JSON file to SQL Server
Francesco Mantovani
  • 10,216
  • 13
  • 73
  • 113
  • Manntovani I was thinking of using a language to save the Word files in XML, Excel, etc. I'm not familiar with Python though. Do you know of any examples with C# or VB.NET? Thank you for your approach and comment! – outdoorsman33 Mar 16 '22 at 18:52
  • There are many examples, Google them: https://www.e-iceblue.com/Tutorials/Spire.Doc/Spire.Doc-Program-Guide/How-to-Use-C-/VB.NET-to-Convert-Word-Doc-to-XML.html – Francesco Mantovani Mar 16 '22 at 20:51
  • @Franceso Mantovani Appreciate the example! – outdoorsman33 Mar 17 '22 at 02:41
1

1. Using a script component as a source

To import data from Microsoft Word into SQL Server, you can use a script component as a data source where you can implement a C# script to parse document files using Office Interoperability libraries or any third-party assembly.

2. Extracting XML from DOCX file

DOCX file is composed of several embedded files. Text is mainly stored within an XML file. You can use a script task or Execute Process Task to extract the DOCX file content and use an XML source to read the data.

3. Converting the Word document into a text file

The third approach is to convert the Word document into a text file and use a flat-file connection manager to read the data.

Hadi
  • 36,233
  • 13
  • 65
  • 124