I am currently starting a project where I need to manage contractual requirements for a large-scale engineering and construction contract.
Unfortunately, all of the identified project requirements were delivered via PDF (there are thousands of Reqs...). I’ve since taken these PDFs and converted them to spreadsheets in Excel. I will eventually use .CSV files to import these into our RM Tool.
My problem is that all of the project requirement PDFs were written for ease of readability- not so much for use in spreadsheet form. Every section is written like a numbered list, which is fine, but I do not need to have the requirements decomposed to the level they are.
I need to take the “outline, list-numbered format” of the docs and be able to have the child requirements (sub items in example below), combined (concatenated) into rows based on their sections and list numbering.
Right now, I am doing all of this merging of rows by hand, but I don't see how I can get the task done quick enough.
Here is an example of how the PDFs look from the client:
Section 1-1.1: General
A. The contractor shall do “this”, then “that”.
“This” will cost less than this much money
“That” will cost less than this amount of money
a. If “that” costs more, it should not be added
b. Another option is “this”
B. The contractor shall name “this”...
- The name should use proper grammar
C. The contractor shall complete work before 2022
Section 1-2.1: Materials
A. The contractor shall use these three materials:
Aluminum
Steel
Cement
a. Cement should be gray only
__i. Gray coloring must be this shade
__ii. Gray cement must not lose coloring
b. Cement should be mixed on-site
Section 1-2.2: Material Suppliers
A. Aluminum must be supplied by “ABC, Inc”
B. Steel must be supplied by “DEF, Inc”
C. Cement must be supplied by "GHI, Inc"
Table 1-1: Supplier Contact Info
[TABLE HERE]
Section 1-3.1: Landscaping
...
Here is how the Excel docs look in their main state. Note that we have added attributes to support each requirement item row, so there is not just one or two columns in these sheets:
| [Col A] Requirement List Item | [Col B] Requirement Text |
|-------------------------------|-------------------------------------------------|
| Section 1-1.1 | General |
| A. | The contractor shall do “this”, then “that”. |
| 1 | “This” will cost less than this much money |
| 2 | “That” will cost less than this amount of money |
| a. | If “that” costs more, it should not be added |
| b. | Another option is “this” |
| B. | The contractor shall name “this”... |
| 1 | The name should use proper grammar |
| C. | The contractor shall complete work before 2022 |
| Section 1-2.1 | Materials |
| A. | The contractor shall use these three materials: |
| 1 | Aluminum |
| 2 | Steel |
| 3 | Cement |
| a. | Cement should be gray only |
| i. | Gray coloring must be this shade |
| ii. | Gray cement must not lose coloring |
| b. | Cement should be mixed on-site |
| Section 1-2.2 | Material Suppliers |
| A. | Aluminum must be supplied by “ABC, Inc” |
| B. | Steel must be supplied by “DEF, Inc” |
| C. | Cement must be supplied by "GHI, Inc" |
| Table 1-1 | [Table Here] |
| Section 1-3.1 | Landscaping |
| ... | ... |
As you can see, the current format I have just has each line item in a different row, with no clear indication of whether or not it's a parent or child item. It is easy to tell what items exist within each Section, but we need to determine what items are included for each top level (Level 1 being A, B, C, D items).
Finally, here is what we’re hoping to have for an end result. We want to just have the rows of data sorted by their section number and then their top-level child requirements (Levels A,B,C,D in this example):
Does anyone have an idea on how to create a Python script to handle this task? I have been trying to figure this out for a week and a half, but I can’t seem to find the right functions to use. I have looked up Excel functions/forumlas, and different uses for Concat and Merge/TextJoin.
TLDR: I want to take the Project Requirement data given to us in numbered list format (from PDFs) and organize it in Excel spreadsheets to group/concate/merge rows only based on what the section and top-level sub-item data is (Please see examples above).
Thank you for any help/advice. I am really interested in learning script-writing to make my job easier, I just am at a loss as to where to start with something like this.
Here is the VBA code that is unable to complete the task. It seems to break after 2-3 sections of data:
Sub GroupLists()
Dim startRow As Integer
Dim lastRow As Integer
Dim outputRow As Integer
outputRow = 0
Dim outputStr As String
Dim col_list As Integer
Dim col_list_str As String
Dim col_content As Integer
Dim col_content_str As String
Dim currentList As String
Dim i As Integer
Dim j As Integer
Dim parseSheet As String
Dim outputSheet As String
Dim startList As Boolean
startList = False
Dim indent As Integer
indent = 0
'regex patterns
Dim regexObject As RegExp
Set regexObject = New RegExp
Dim pattern_listHeader As String
Dim pattern_listUpAlpha As String
Dim pattern_listLowAlpha As String
Dim pattern_listNum As String
pattern_listHeader = "[\d]-"
pattern_listUpAlpha = "[A-Z]"
pattern_listLowAlpha = "[a-z]"
'configurable variables
parseSheet = "Input"
outputSheet = "Output"
col_list = 2
col_content = 3
outputRow = 1
startRow = Worksheets("Control").Cells(2, 1).Value
lastRow = Worksheets("Control").Cells(2, 2).Value
For i = startRow To lastRow
col_list_str = Worksheets(parseSheet).Cells(i, col_list).Value
col_content_str = Worksheets(parseSheet).Cells(i, col_content).Value
With regexObject
.Pattern = "^[\d]-"
End With
If regexObject.Test(col_list_str) = True Then
If startList = True Then
'write output to row, then write current row to output row
Worksheets(outputSheet).Cells(outputRow, 1).Value = outputStr
outputRow = outputRow + 1
End If
outputStr = Worksheets(parseSheet).Cells(i, col_content)
'write to row
Worksheets(outputSheet).Cells(outputRow, 1).Value = outputStr
'increment row
outputRow = outputRow + 1
outputStr = ""
startList = False
End If
With regexObject
.Pattern = "^[A-Z]."
End With
If regexObject.Test(col_list_str) = True Then
If startList = True Then
startList = False
Else
startList = True
End If
outputStr = outputStr & col_list_str & col_content_str & vbNewLine
'outputRow = outputRow + 1
End If
If startList = True Then
With regexObject
.Pattern = "^[a-z]."
End With
If regexObject.Test(col_list_str) = True Then
'indent = 1
outputStr = outputStr & col_list_str & col_content_str & vbNewLine
'outputRow = outputRow + 1
End If
With regexObject
.Pattern = "^[0-9]."
End With
If regexObject.Test(col_list_str) = True Then
'indent = indent + 1
outputStr = outputStr & col_list_str & col_content_str & vbNewLine
'outputRow = outputRow + 1
End If
End If
Next i
If startList = True Then
Worksheets(outputSheet).Cells(outputRow, 1).Value = outputStr
End If
End Sub
Typical duplication shown below. In response to comments for @Dy.Lee.