I have a string which looks like :
string = ['1.7 DELIVERY, STORAGE AND HANDLING \n', ' \n', 'A. Delivery and Acceptance Requirements: \n', '1. Do not deliver items to the site, until all specified submittals have been \n', 'submitted to, and approved by, the Architect. \n', '2. Deliver materials in original packages, containers or bundles bearing brand name \n', 'and identification of manufacturer or supplier. \n', ' \n', 'B. Storage and Handling Requirements: \n', "1. Store and handle materials following manufacturer's recommended procedures, \n", 'and in accordance with material safety data sheets. \n', '2. Protect materials from damage due to moisture, direct sunlight, excessive \n', 'temperatures, surface contamination, corrosion and damage from \n', 'construction operations and other causes. \n', ' \n', 'C. Damaged material: Remove any damaged or contaminated materials from job site \n', 'immediately, including materials in packages containing water marks, or show \n', 'evidence of mold. \n', ' \n']
I want to extract sections with alphabets (A-Z) and their coresponding sub-sections with numbers (can range between 1 and 20). I have wrote a script that extracts section as -
regex=r"\b([A-Z]\s*\.\s*)\b"
for index,new_string in enumerate(string):
match=re.search(regex, new_string)
if match:
print(index)
The problem is I'm also getting unwanted search words in that specific section. For example, the string below starts from section 'A' but is taking 'B' as a section as well.
"A. General: Notify the Architect B. where conflicts apply between referenced standards and existing materials, and existing methods of construction. \n"
I want output in the form of dictionary with keys as sections and values as sub sections. Also, I want to join the sections and sub sections as sometimes they get carried over to the next string due to OCR output. Also '\n'
as elements in the list has no significance. Sometimes they are there in abundance, sometimes not there. So I want regex to search sections as alphabets and sub sections as numbers only!.
Example output -
{
'A. Delivery and Acceptance Requirements: ' : ["1. Do not deliver items to the site, until all specified submittals have been submitted to, and approved by, the Architect. \n","2. Deliver materials in original packages, containers or bundles bearing brand name and identification of manufacturer or supplier."]
'B. Storage and Handling Requirements: ' : ["1. Store and handle materials following manufacturer's recommended procedures, and in accordance with material safety data sheets. ", and so on..]
}