3

I am currently starting a project where I need to manage contractual requirements for a large-scale engineering and construction contract.

Unfortunately, all of the identified project requirements were delivered via PDF (there are thousands of Reqs...). I’ve since taken these PDFs and converted them to spreadsheets in Excel. I will eventually use .CSV files to import these into our RM Tool.

My problem is that all of the project requirement PDFs were written for ease of readability- not so much for use in spreadsheet form. Every section is written like a numbered list, which is fine, but I do not need to have the requirements decomposed to the level they are.

I need to take the “outline, list-numbered format” of the docs and be able to have the child requirements (sub items in example below), combined (concatenated) into rows based on their sections and list numbering.

Right now, I am doing all of this merging of rows by hand, but I don't see how I can get the task done quick enough.

Here is an example of how the PDFs look from the client:


Section 1-1.1: General

A. The contractor shall do “this”, then “that”.

  1. “This” will cost less than this much money

  2. “That” will cost less than this amount of money

    a. If “that” costs more, it should not be added

    b. Another option is “this”

B. The contractor shall name “this”...

  1. The name should use proper grammar

C. The contractor shall complete work before 2022


Section 1-2.1: Materials

A. The contractor shall use these three materials:

  1. Aluminum

  2. Steel

  3. Cement

    a. Cement should be gray only

    __i. Gray coloring must be this shade

    __ii. Gray cement must not lose coloring

    b. Cement should be mixed on-site

Section 1-2.2: Material Suppliers

A. Aluminum must be supplied by “ABC, Inc”

B. Steel must be supplied by “DEF, Inc”

C. Cement must be supplied by "GHI, Inc"


Table 1-1: Supplier Contact Info

[TABLE HERE]


Section 1-3.1: Landscaping

...


Here is how the Excel docs look in their main state. Note that we have added attributes to support each requirement item row, so there is not just one or two columns in these sheets:

CURRENT EXAMPLE IMAGE

| [Col A] Requirement List Item | [Col B] Requirement Text                        |
|-------------------------------|-------------------------------------------------|
| Section 1-1.1                 | General                                         |
| A.                            | The contractor shall do “this”, then “that”.    |
| 1                             | “This” will cost less than this much money      |
| 2                             | “That” will cost less than this amount of money |
| a.                            | If “that” costs more, it should not be added    |
| b.                            | Another option is “this”                        |
| B.                            | The contractor shall name “this”...             |
| 1                             | The name should use proper grammar              |
| C.                            | The contractor shall complete work before 2022  |
| Section 1-2.1                 | Materials                                       |
| A.                            | The contractor shall use these three materials: |
| 1                             | Aluminum                                        |
| 2                             | Steel                                           |
| 3                             | Cement                                          |
| a.                            | Cement should be gray only                      |
| i.                            | Gray coloring must be this shade                |
| ii.                           | Gray cement must not lose coloring              |
| b.                            | Cement should be mixed on-site                  |
| Section 1-2.2                 | Material Suppliers                              |
| A.                            | Aluminum must be supplied by “ABC, Inc”         |
| B.                            | Steel must be supplied by “DEF, Inc”            |
| C.                            | Cement must be supplied by "GHI, Inc"           |
| Table 1-1                     | [Table Here]                                    |
| Section 1-3.1                 | Landscaping                                     |
| ...                           | ...                                             |

As you can see, the current format I have just has each line item in a different row, with no clear indication of whether or not it's a parent or child item. It is easy to tell what items exist within each Section, but we need to determine what items are included for each top level (Level 1 being A, B, C, D items).


Finally, here is what we’re hoping to have for an end result. We want to just have the rows of data sorted by their section number and then their top-level child requirements (Levels A,B,C,D in this example):

FINAL EXAMPLE IMAGE

Does anyone have an idea on how to create a Python script to handle this task? I have been trying to figure this out for a week and a half, but I can’t seem to find the right functions to use. I have looked up Excel functions/forumlas, and different uses for Concat and Merge/TextJoin.

TLDR: I want to take the Project Requirement data given to us in numbered list format (from PDFs) and organize it in Excel spreadsheets to group/concate/merge rows only based on what the section and top-level sub-item data is (Please see examples above).

Thank you for any help/advice. I am really interested in learning script-writing to make my job easier, I just am at a loss as to where to start with something like this.

Here is the VBA code that is unable to complete the task. It seems to break after 2-3 sections of data:

Sub GroupLists()
Dim startRow As Integer
Dim lastRow As Integer
Dim outputRow As Integer
outputRow = 0
Dim outputStr As String
Dim col_list As Integer
Dim col_list_str As String
Dim col_content As Integer
Dim col_content_str As String

Dim currentList As String
Dim i As Integer
Dim j As Integer

Dim parseSheet As String
Dim outputSheet As String

Dim startList As Boolean
startList = False

Dim indent As Integer
indent = 0

'regex patterns
Dim regexObject As RegExp
Set regexObject = New RegExp
Dim pattern_listHeader As String
Dim pattern_listUpAlpha As String
Dim pattern_listLowAlpha As String
Dim pattern_listNum As String
pattern_listHeader = "[\d]-"
pattern_listUpAlpha = "[A-Z]"
pattern_listLowAlpha = "[a-z]"

'configurable variables
parseSheet = "Input"
outputSheet = "Output"

col_list = 2
col_content = 3
outputRow = 1

startRow = Worksheets("Control").Cells(2, 1).Value
lastRow = Worksheets("Control").Cells(2, 2).Value

For i = startRow To lastRow
   
    col_list_str = Worksheets(parseSheet).Cells(i, col_list).Value
    col_content_str = Worksheets(parseSheet).Cells(i, col_content).Value
   
    With regexObject
        .Pattern = "^[\d]-"
    End With
   
    If regexObject.Test(col_list_str) = True Then
        If startList = True Then
            'write output to row, then write current row to output row
            Worksheets(outputSheet).Cells(outputRow, 1).Value = outputStr
            outputRow = outputRow + 1
        End If
       
        outputStr = Worksheets(parseSheet).Cells(i, col_content)
        'write to row
        Worksheets(outputSheet).Cells(outputRow, 1).Value = outputStr
        'increment row
        outputRow = outputRow + 1
        outputStr = ""
        startList = False
    End If
   
    With regexObject
        .Pattern = "^[A-Z]."
    End With
   
    If regexObject.Test(col_list_str) = True Then
        If startList = True Then
            startList = False
        Else
            startList = True
        End If
        outputStr = outputStr & col_list_str & col_content_str & vbNewLine
        'outputRow = outputRow + 1
    End If
   
    If startList = True Then
       
        With regexObject
            .Pattern = "^[a-z]."
        End With
       
        If regexObject.Test(col_list_str) = True Then
            'indent = 1
            outputStr = outputStr & col_list_str & col_content_str & vbNewLine
            'outputRow = outputRow + 1
        End If
       
        With regexObject
            .Pattern = "^[0-9]."
        End With
           
        If regexObject.Test(col_list_str) = True Then
            'indent = indent + 1
            outputStr = outputStr & col_list_str & col_content_str & vbNewLine
            'outputRow = outputRow + 1
        End If
       
       
       
    End If
   
   
   
Next i

If startList = True Then
    Worksheets(outputSheet).Cells(outputRow, 1).Value = outputStr
End If

End Sub

enter image description here


Typical duplication shown below. In response to comments for @Dy.Lee.

Duplicate numbering items in column B

Amber W
  • 33
  • 4
  • 3
    Where does the attribute info come from? The rest isn't too difficult especially if the word "Section" appears in col A for each new section – QHarr Mar 06 '21 at 18:11
  • 1
    Hi, @QHarr, thanks for the help. The attribute data is entered from the Sys Eng team. These attributes will be populated once the rows have been merged. The data will contain items such as Requirement Owners, Product Breakdown data, Compliance data, etc. – Amber W Mar 06 '21 at 18:26
  • Each new section does have a "Section ___" listed. They also have an attribute that can be assigned as "Heading". – Amber W Mar 06 '21 at 18:28
  • 1
    I think you may find existing answers to help you already on StackOverflow though maybe with a different approach from mine. With mine you would work with a dictionary and arrays. Sorry for the number of comments :-) – QHarr Mar 06 '21 at 18:33
  • 1
    Thank you again, this is really helpful info so far! I just attempted the edits to the post you've suggested. My second table didn't turn out so great since I need the data in Col B joined together. – Amber W Mar 06 '21 at 18:42
  • 1
    Just remove the second table. You have the image. – QHarr Mar 06 '21 at 18:54

2 Answers2

3

As I wrote so many comments I decided to post a guide answer and delete my comments i.e. provide some pointers to one possible solution.


Logic:

I would read everything into an array and loop the rows of that. Apply logic that processes rows according to rules e.g. does it start with Section..... is it a Capital letter at start of col 1 of array... is it a number...

A nice idea is probably to use helper functions. E.g.

  • A helper function which grabs the lines upper case after Section down to next upper case (stop before) and returns that text (as an array).
  • Then pass that function return value to another function which handles the alphanumeric based indentation levels (using appropriate Chr$()) etc
  • Another function call to concatenate those values into single string - which will be the value associated with a given key.
  • Then update the dictionary with the key value pair.

This way you can create a dictionary of key A: associated text from last of helper function calls and write out to sheet at end.

You will need additional logic for tables etc. Tables would be added to dictionary as an array with key being value from first column

At the end you can loop and write out to sheet. I think caution is needed with writing out tables as they may take more than one row. In that case, have an additional helper function that calculates last populated row anywhere in sheet (or target column range) and ensure next key:value pair is written out to after the table by adding + 1 to last populated row number.

When you write out the table, you will need to test the dictionary key for it containing 'table', or that its value is an array; you will need to resize the target cell to the size of the array e.g. if dict(key) = results then targetCell.Resize(UBound(results, 1), UBound(results, 2)) = results

Think about how any hyperlinks (where friendly name different from destination)| additional metadata will be passed around.


Supplying sample data:

I think I would suggest you use a markdown table generator to supply the current data (the input data) to save people time having to write out their own data. You can paste from Excel into markdown generator, press generate table, then copy to clipboard that table, use edit to insert into question. Highlight table just pasted in and press Ctrl + K to indent properly.


Regex:

Regex is not required from what I can see. Data resides in separate cells in source.


Tagging:

I think remove the python tag unless you have python code to add.


Resources:

  1. https://bettersolutions.com/vba/strings-characters/builtin-constants.htm

Some useful constants regarding indentation

| VBA.Constants | Chr                   | Comments                                                                                                       |
|---------------|-----------------------|----------------------------------------------------------------------------------------------------------------|
| vbCr          | Chr(13)               | Carriage return character                                                                                      |
| vbLf          | Chr(10)               | Linefeed character                                                                                             |
| vbCrLf        | Chr(13) + Chr(10)     | Carriage return - linefeed combination                                                                         |
| vbNewLine     | Chr(13) + Chr(10)     | New line character                                                                                             |
| vbNullChar    | Chr(0)                | Character having a value of 0.                                                                                 |
| vbNullString  | String having value 0 | Not the same as a zero-length string (""); used for calling external procedures. Cannot be passed to any DLL's |
| vbTab         | Chr(9)                | Tab character                                                                                                  |
| vbBack        | Chr(8)                | Backspace character                                                                                            |
| vbFormFeed    | Chr(12)               | Word VBA Manual - manual page break ?                                                                          |
| vbVerticalTab | Chr(11)               | Word VBA Manual - manual line break (Shift + Enter)                                                            |
  1. Dictionaries -
  2. Passing and returning arrays - http://www.cpearson.com/excel/passingandreturningarrays.htm
  3. Arrays and ranges - http://www.cpearson.com/excel/ArraysAndRanges.aspx
  4. Typed functions - What's the difference between Trim() and Trim$() in VBA? re: Chr$ v Chr.

Alternative approach:

If working with arrays and dictionary (which should be quick) doesn't work for you, the logic regarding how to separate out your chunks/indentations can still apply in a loop over the actual rows in the sheet. You would need to keep track however of where you are writing out to based on last row in destination range (you can use a different last row helper function and pass in the column to use to determine the last populated row in).

QHarr
  • 83,427
  • 12
  • 54
  • 101
1

Try,

Sub setGroup()
    Dim Ws As Worksheet, toWs As Worksheet
    Dim vDB As Variant, vR() As Variant
    Dim vResult(), vRow()
    Dim s As String, s2 As String
    Dim sName As String
    Dim i As Long, j As Long, r As Long
    Dim cnt As Long, n As Long, k As Integer
    Dim st As Long, et As Long
    
    Set Ws = Sheets(1) 'Data Sheet
    Set toWs = Sheets(2) 'Result Sheet
    
    vDB = Ws.Range("a1").CurrentRegion
    r = UBound(vDB, 1)
    
    ReDim vRow(1 To r)
    For i = 2 To r
        s = vDB(i, 1)
        If s Like "[A-Z].*" Or s Like "Section*" Or s Like "Table*" Then
            n = n + 1
            vRow(n) = i
        End If
    Next i
    If vRow(n) < r Then
        ReDim Preserve vRow(1 To n + 1)
        vRow(n + 1) = r
    Else
       ReDim Preserve vRow(1 To n)
    End If
    
    cnt = UBound(vRow)
    
    ReDim vResult(1 To r, 1 To 2)
    n = 0
    sName = vDB(1, 1)
    For j = 1 To cnt - 1
        k = 0
        Erase vR
        st = vRow(j)
        If j = cnt - 1 Then
            et = vRow(j + 1)
        Else
            et = vRow(j + 1) - 1
        End If
  
        For i = st To et
            s = vDB(i, 1)
            s2 = ""
            
            '*** Set spacing for each group
            If s Like "[A-Z].*" Or s Like "Section*" Or s Like "Table*" Then
            Else
                If IsNumeric(s) Then
                    s2 = Space(4) & s & "."
                ElseIf s Like "[a-z].*" And Not (s Like "i*") Then
                    s2 = Space(8) & s
                Else
                    s2 = Space(12) & s
                End If
            End If
                
            'Collect data according to conditions.
            If s Like "Section*" Then
                n = n + 1
                vResult(n, 1) = vDB(i, 1)
                vResult(n, 2) = vDB(i, 2)
            ElseIf s Like "[A-Z].*" Then
                n = n + 1
                k = k + 1
                sName = s
                ReDim Preserve vR(1 To k)
                vR(k) = vDB(i, 2)
            ElseIf s Like "Table*" Then
                n = n + 2
                vResult(n - 1, 1) = vDB(i, 1)
                vResult(n - 1, 2) = "Supplier Contct Infor"
                vResult(n, 2) = vDB(i, 2)
            Else
                k = k + 1
                ReDim Preserve vR(1 To k)
                vR(k) = s2 & " " & vDB(i, 2)
            End If
        Next i
        If k Then
            vResult(n, 1) = sName
            vResult(n, 2) = Join(vR, vbCrLf)
        Else
        End If
    Next j

    With toWs
        .Cells.Clear
        .Range("a1").Resize(n, 2) = vResult
    End With
        
End Sub

Data Image

enter image description here

Result Image

enter image description here

Dy.Lee
  • 7,527
  • 1
  • 12
  • 14
  • Hi Dy.Lee - This works really well with my project data! Thank you so much for your help. I am just seeing one last piece I'm hoping to adjust. When the data rows merge, it seems to create a duplicate of the list numbering data. Any ideas how to prevent the duplication? Ex: C. The following elements of Infrastructure shall be as shown in Figures 1.. 1. the location and alignment of the...including; a. a. number of Tracks; b. b. alignment of Trackway; and... – Amber W Mar 07 '21 at 19:27
  • To summarize the above comment... I am seeing the list numbering data (a, b, c, d; i, ii, iii, iv, etc) being duplicated on the results tab. Is there a way to remove this duplication ? – Amber W Mar 07 '21 at 19:29
  • @AmberW, I didn't understand your question correctly. It would be nice to give an example of the merged cell with an image. – Dy.Lee Mar 07 '21 at 23:09
  • Screenshot of duplication shown at very bottom of post. Thanks! – Amber W Mar 08 '21 at 13:11
  • @AmberW, The image was inserted. In my case it works fine. I don't know if the original data is different from me. – Dy.Lee Mar 08 '21 at 13:49
  • Ahh yes, I've just figured it out! Code works perfectly. My test data (from my project) has a duplicate. I've fixed it. Thank you so much! – Amber W Mar 08 '21 at 15:17
  • @AmberW,Glad it was helpful. – Dy.Lee Mar 08 '21 at 15:20