0

I am trying to parse certain paragraphs out of multiple text file and store them in list. All the text file have some similar format to this:

MODEL NUMBER: A123

MODEL INFORMATION: some info about the model

DESCRIPTION: This will be a description of the Model. It 
could be multiple lines but an empty line at the end of each.

CONCLUSION: Sold a lot really profitable.

Now i can pull out the information where its one line, but am having trouble when i encounter something which is multiple line (like 'Description'). The description length is not known but i know at the end it would have an empty line (which would mean using '\n'). This is what i have so far:

import os
dir = 'Test'
DESCRIPTION = []
for files in os.listdir(dir):
    if files.endswith('.txt'): 
        with open(dir + '/' + files) as File:
            reading = File.readlines()
            for num, line in enumerate(reading):
                if 'DESCRIPTION:' in line:
                    Start_line = num
                if len(line.strip()) == 0:

I don't know if its the best approach, but what i was trying to do with if len(line.strip()) == 0: is to create a list of blank lines and then find the first greater value than Start_Line. I saw this Bisect.

In the end i would like my data to be if i say print Description

['DESCRIPTION: Description from file 1', 
'DESCRIPTION: Description from file 2', 
'DESCRIPTION: Description from file 3,]

Thanks.

Community
  • 1
  • 1
USER420
  • 337
  • 3
  • 12
  • The easiest way I could think is to check if the line begins with tag like (DESCRIPTION: ) and till you reach another known tag you can assume it belongs to a multiline description of that tag. – Sanju Sep 15 '16 at 16:23

1 Answers1

1

Regular expression. Think about it this way: you have a pattern that will allow you to cut any file into pieces you will find palatable: "newline followed by capital letter"

re.split is your friend

Take a string

"THE
BEST things 
in life are
free
IS
YET
TO
COME"

As a string:

p = "THE\nBEST things\nin life are\nfree\nIS\nYET\nTO\nCOME"
c = re.split('\n(?=[A-Z])', p)

Which produces list c

['THE', 'BEST things\nin life are\nfree', 'IS', 'YET', 'TO', 'COME']

I think you can take it from there, as this would separate your files into each a list of strings with each string beings its own section, then from there you can find the "DESCRIPTION" element and store it, you see that you separate each section, including its subcontents by that re split. Important to note that the way I've set up the regex it recognies the PATTERN "newline and then Capital Letter" but CUTS after the newline, which is why it is outside the brackets.

mstorkson
  • 1,130
  • 1
  • 10
  • 26