-1

I've already posted a similar question regarding the text extraction in Python with regular expressions, but I have another issue with non-greedy quantifiers, so I am asking a question with a little bit different example. The issue is I need to extract all the relevant portions of the string text by using regular expressions in Python with two specific matches. To be specific, here is an example text:

example = """
    The Bank does offer a hybrid loan. Hybrid loans are loans that start as a 
    fixed rate mortgage but after a set number of years automatically adjust 
    to an adjustable rate mortgage. The Bank offers a three year fixed rate mortgage 
    after which the interest rate will adjust annually. Item 1. Business 3-13 Item 1a. 
    Risk Factors 13-15 Item 1b. Unresolved Staff Comments 15 Item 2. Properties 15-16
    The forward-looking statements are made as of the date of this report,
    and the Company assumes no obligation to update the forward-looking statements 
    or to update the reasons why actual results could differ from those projected 
    in the forward-looking statements. PART 1. ITEM 1. BUSINESS 
    General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
    incorporated under the laws of Ohio in 1985 and elected to become a financial 
    holding company under the Federal Reserve in 2014. Our primary subsidiary, 
    The Farmers & Merchants State Bank (Bank) is a\n community bank operating 
    in Northwest Ohio since 1897.ITEM 2. PROPERTIES Our principal office is located in Archbold, Ohio.
    The Bank operates from the facilities at 307 North Defiance Street. 
    In addition, the Bank owns the property from 200 to 208 Ditto Street, 
    Archbold, Ohio, which it uses for Bank parking and a community mini-park area.
    """

, and and I would like to extract the 'between' portions of the text starting from a start match 'ITEM 1.' and an end match 'ITEM 2.', so the final results should look like this:

final_result_1 = """
    ITEM 1. BUSINESS 
    General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
    incorporated under the laws of Ohio in 1985 and elected to become a financial 
    holding company under the Federal Reserve in 2014. Our primary subsidiary, 
    The Farmers & Merchants State Bank (Bank) is a\n community bank operating 
    in Northwest Ohio since 1897.
    """

final_result_2 = """
    Item 1. Business 3-13 Item 1a. 
    Risk Factors 13-15 Item 1b. Unresolved Staff Comments 15
    """

The order of the final results should be in terms of the length of final result text, so the 'final_result_1' is the longest text portion out of two, and the 'final_result_2' is the shortest one. You could refer to the answers to the previous question here. Thank you in advance!

krcoder
  • 199
  • 2
  • 11
  • I would love to help, but this question is pretty confusing. Could you create some shorter sample text and explain a little more what you want for output? – Liam Bohl Jul 03 '17 at 04:06
  • @krcoder, You need to exclude the `ITEM 2` from the text as well right? – hridayns Jul 03 '17 at 04:08
  • @code_byter, That's true, as well as 'Item 2' to be excluded for the 'final_result_2'. – krcoder Jul 03 '17 at 04:11
  • @Lian Bohl, As I mentioned above, one specific issue regarding the example text is the non-greedy match should be implemented in that there are multiple start ('item 1') and end ('item 2') matches throughout the text. To be specific, in the example text above, on the fourth line, there is the first start match starting with 'Item 1. Business 3-13 ...', and on the fifth line, there is the first end match starting with 'Item 2. Properties 15-16 ...'. – krcoder Jul 03 '17 at 04:23
  • @Lian Bohl, And on the ninth line, there is the second start match starting with 'ITEM 1. BUSINESS General ...', and on the fourteenth line, there is the second end match starting with 'ITEM 2. PROPERTIES Our principal'. – krcoder Jul 03 '17 at 04:23
  • @krcoder Please do check my answer now. I think its what you want – hridayns Jul 03 '17 at 04:31

1 Answers1

1

I believe you need to use

import re;
example = """
    The forward-looking statements are made as of the date of this report,
    and the Company assumes no obligation to update the forward-looking statements 
    or to update the reasons why actual results could differ from those projected 
    in the forward-looking statements. PART 1. ITEM 1. BUSINESS 
    General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
    incorporated under the laws of Ohio in 1985 and elected to become a financial 
    holding company under the Federal Reserve in 2014. Our primary subsidiary, 
    The Farmers & Merchants State Bank (Bank) is a community bank operating 
    in Northwest Ohio since 1897.ITEM 2. PROPERTIES Our principal office is located in Archbold, Ohio.
    The Bank operates from the facilities at 307 North Defiance Street. 
    In addition, the Bank owns the property from 200 to 208 Ditto Street, 
    Archbold, Ohio, which it uses for Bank parking and a community mini-park area.
"""
matches = re.findall('(ITEM\ 1[\s\S]*)ITEM\ 2', example,re.IGNORECASE);
#Here, matches consists of all the matches in a list. You can sort them by size of string at each index of the list.
matches.sort(key = len, reverse = True)
#Now matches contains a list of the matched strings in reverse order of length (from bigger to smaller)

EDIT: (Not what OP wants)

import re;
example = """
    The forward-looking statements are made as of the date of this report,
    and the Company assumes no obligation to update the forward-looking statements 
    or to update the reasons why actual results could differ from those projected 
    in the forward-looking statements. PART 1. ITEM 1. BUSINESS 
    General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
    incorporated under the laws of Ohio in 1985 and elected to become a financial 
    holding company under the Federal Reserve in 2014. Our primary subsidiary, 
    The Farmers & Merchants State Bank (Bank) is a community bank operating 
    in Northwest Ohio since 1897.ITEM 2. PROPERTIES Our principal office is located in Archbold, Ohio.
    The Bank operates from the facilities at 307 North Defiance Street. 
    In addition, the Bank owns the property from 200 to 208 Ditto Street, 
    Archbold, Ohio, which it uses for Bank parking and a community mini-park area.
"""
pat = re.compile('(ITEM\ 1[\s\S]*)ITEM\ 2',re.IGNORECASE);
matches = pat.findall(example)
print(matches)
#Here, matches consists of all the matches in a list. You can sort them by size of string at each index of the list.
matches.sort(key = len, reverse = True)
#Now matches contains a list of the matched strings in reverse order of length (from bigger to smaller)
print(matches)

Code tested

FINAL EDIT:

import re;
example = """
    The Bank does offer a hybrid loan. Hybrid loans are loans that start as a 
    fixed rate mortgage but after a set number of years automatically adjust 
    to an adjustable rate mortgage. The Bank offers a three year fixed rate mortgage 
    after which the interest rate will adjust annually. Item 1. Business 3-13 Item 1a. 
    Risk Factors 13-15 Item 1b. Unresolved Staff Comments 15 Item 2. Properties 15-16
    The forward-looking statements are made as of the date of this report,
    and the Company assumes no obligation to update the forward-looking statements 
    or to update the reasons why actual results could differ from those projected 
    in the forward-looking statements. PART 1. ITEM 1. BUSINESS 
    General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
    incorporated under the laws of Ohio in 1985 and elected to become a financial 
    holding company under the Federal Reserve in 2014. Our primary subsidiary, 
    The Farmers & Merchants State Bank (Bank) is a\n community bank operating 
    in Northwest Ohio since 1897.ITEM 2. PROPERTIES Our principal office is located in Archbold, Ohio.
    The Bank operates from the facilities at 307 North Defiance Street. 
    In addition, the Bank owns the property from 200 to 208 Ditto Street, 
    Archbold, Ohio, which it uses for Bank parking and a community mini-park area.
"""
pat = re.compile('(ITEM\ 1[\s\S]*?)ITEM\ 2',re.IGNORECASE);
matches = pat.findall(example)
print(matches)
#Here, matches consists of all the matches in a list. You can sort them by size of string at each index of the list.
matches.sort(key = len, reverse = True)
#Now matches contains a list of the matched strings in reverse order of length (from bigger to smaller)

#To check if it works:
for match in matches:
    print(match)
    print('\n')

Why don't you try it now? :)

hridayns
  • 697
  • 8
  • 16
  • Thank you for the answer, but this is not what I expected. As I mentioned above, one specific issue regarding the example text is the non-greedy match should be implemented in that there are multiple start ('item 1') and end ('item 2') matches throughout the text. – krcoder Jul 03 '17 at 04:17
  • I see. Let me see what I can do. – hridayns Jul 03 '17 at 04:19
  • To be specific, in the example text above, on the fourth line, there is the first start match starting with 'Item 1. Business 3-13 ...', and on the fifth line, there is the first end match starting with 'Item 2. Properties 15-16 ...'. – krcoder Jul 03 '17 at 04:20
  • And on the ninth line, there is the second start match starting with 'ITEM 1. BUSINESS General ...', and on the fourteenth line, there is the second end match starting with 'ITEM 2. PROPERTIES Our principal'. – krcoder Jul 03 '17 at 04:21
  • Yeah. I understood already from the question. I just forgot about the non greedy part while testing it out. – hridayns Jul 03 '17 at 04:23
  • I changed the matching pattern from `(ITEM\ 1[\s\S]*)ITEM\ 2` to `(ITEM\ 1[\s\S]*?)ITEM\ 2`. Notice the `?` added – hridayns Jul 03 '17 at 04:26
  • Thank you so much for your prompt answer! – krcoder Jul 03 '17 at 04:34