1

I read a PDF into python and would like to extract specific paragraphs from it. For this I'm using python and try to get the selection via regex. To illustrate the case, here is an example.

INTERNATIONAL MONETARY FUND            7\n\x0cBELGIUM\n\n\n\nPOLICY DISCUSSIONS—MAINTAINING THE REFORM\nMOMENTUM\n7.     The current recovery is an opportunity to strengthen the resilience and growth\npotential of the Belgian economy. The government's ability to deal with future shocks will depend\non whether it implements the right policies now while the economy continues to recover.\n\n\uf0b7   First, with public debt above 100 percent of GDP and only starting to come down, Belgium still\n    has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will\n    require following through on plans to gradually move toward structural balance.\n\n\uf0b7   Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,\n    further labor and product market reforms are needed to increase productivity growth, raise\n    potential output, and integrate vulnerable groups into the labor market.\n\n\uf0b7   Third, although the financial sector has recovered since the crisis and is generally sound, cyclical\n    vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance\n    and proactive policies.3\n\n8.      The government agreed last summer on a new package of measures related to\ntaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was\na reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be\nphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in\n2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was\nmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-\ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the\nmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.\n\n9.      Policy discussions focused on the importance of maintaining the reform momentum\nand not yielding to complacency. Achieving the balanced budget goal will require efforts at all\nlevels of government to make spending more efficient and safeguard revenues (Section A).\nA combination of policies and reforms could help raise productivity growth, including increasing\ninvestment in infrastructure and enhancing competition in services (Section B). To fully realize\nBelgium's employment potential, it will be critical to address the severe fragmentation of the labor\nmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in the\nmortgage market and carefully navigate the transition toward a European Banking Union (Section D).\n\n\n\n\n3\n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector\nAssessment Program (FSAP).\n4\n  The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with\na deduction that is the product of corporate equity and a notional interest rate.\n\n\n8

Each paragraph starts with a number, one or two digits, followed by a dot and three to seven blank spaces. The end consists of the next double new line \n\n followed by a number, one or two digits, followed by a dot. Notice this should also act as the next starting point. In the example above, I should find the three paragraphs:

first paragraph:

  1. The current recovery is an opportunity to strengthen the resilience and growth\npotential of the Belgian economy. The government's ability to deal with future shocks will depend\non whether it implements the right policies now while the economy continues to recover.\n\n\uf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium still\n has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will\n require following through on plans to gradually move toward structural balance.\n\n\uf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,\n further labor and product market reforms are needed to increase productivity growth, raise\n potential output, and integrate vulnerable groups into the labor market.\n\n\uf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclical\n vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance\n and proactive policies.3\n\n

second paragraph:

  1. The government agreed last summer on a new package of measures related to\ntaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was\na reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be\nphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in\n2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was\nmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-\ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the\nmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.\n\n

and finally the third:

  1. Policy discussions focused on the importance of maintaining the reform momentum\nand not yielding to complacency. Achieving the balanced budget goal will require efforts at all\nlevels of government to make spending more efficient and safeguard revenues (Section A).\nA combination of policies and reforms could help raise productivity growth, including increasing\ninvestment in infrastructure and enhancing competition in services (Section B). To fully realize\nBelgium's employment potential, it will be critical to address the severe fragmentation of the labor\nmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in the\nmortgage market and carefully navigate the transition toward a European Banking Union (Section D).\n\n\n\n\n3\n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector\nAssessment Program (FSAP).\n4\n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with\na deduction that is the product of corporate equity and a notional interest rate.\n\n

I've tried to use the following regex: r'(?m)[0-99].*[.] {3,7} (.*?) \n\n with the reasoning to select everything from the start to the end

  1. (?m)[0-99].*[.] {3,7}: To identify the beginning, for each line separately.
  2. \n\n specifying the end.

However, it doesn't find anything with it.

math
  • 1,868
  • 4
  • 26
  • 60
  • 2
    If you think `[0-99]` match numbers from `0` to `99`, you are [wrong](https://stackoverflow.com/questions/3148240/why-doesnt-01-12-range-work-as-expected). You may replace that with `\d\d?`. `re.M` (`(?m)`) modifies `^` and `$`, you do not have them in the pattern. You must have wanted to use `(?s)`. Try `r'(?sm)^\d\d?\. {3,7}(.*?)(?:\n\n|\Z)'`, see [the regex demo](https://regex101.com/r/auWfI0/2). – Wiktor Stribiżew Nov 20 '18 at 13:41
  • Can you provide de raw input? – Edilson Borges Nov 20 '18 at 13:42

1 Answers1

3

The [0-99] pattern is erroneous since it matches any 1 digit from 0 to 9. See Why doesn't [01-12] range work as expected?. The re.M ((?m)) modifies ^ and $ anchors, but you haved neither in the pattern.

You may use

r'(?sm)^\d\d?\. {3,7}(.*?)(?=\n\n\d\d?\. |\Z)'

See the regex demo.

Details

  • (?sm) - re.DOTALL and re.MULTILINE options enabled
  • ^ - start of a line
  • \d\d? - 1 or 2 digits (0 to 99)
  • \. - a dot
  • <code> {3,7}</code> - 3 to 7 spaces (replace with[^\S\r\n]{3,7}` to match any horizontal whitespace)
  • (.*?) - Group 1: any 0+ chars as few as possible
  • (?=\n\n\d\d?\. |\Z) - a location, immediately followed with two newline chars (\n\n) and then 1 or 2 digits (\d\d?) and a dot followed with space or (|) end of the whole string (\Z).

Python demo:

import re
s="INTERNATIONAL MONETARY FUND            7\n\x0cBELGIUM\n\n\n\nPOLICY DISCUSSIONS—MAINTAINING THE REFORM\nMOMENTUM\n7.     The current recovery is an opportunity to strengthen the resilience and growth\npotential of the Belgian economy. The government's ability to deal with future shocks will depend\non whether it implements the right policies now while the economy continues to recover.\n\n\uf0b7   First, with public debt above 100 percent of GDP and only starting to come down, Belgium still\n    has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will\n    require following through on plans to gradually move toward structural balance.\n\n\uf0b7   Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,\n    further labor and product market reforms are needed to increase productivity growth, raise\n    potential output, and integrate vulnerable groups into the labor market.\n\n\uf0b7   Third, although the financial sector has recovered since the crisis and is generally sound, cyclical\n    vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance\n    and proactive policies.3\n\n8.      The government agreed last summer on a new package of measures related to\ntaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was\na reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be\nphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in\n2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was\nmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-\ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the\nmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.\n\n9.      Policy discussions focused on the importance of maintaining the reform momentum\nand not yielding to complacency. Achieving the balanced budget goal will require efforts at all\nlevels of government to make spending more efficient and safeguard revenues (Section A).\nA combination of policies and reforms could help raise productivity growth, including increasing\ninvestment in infrastructure and enhancing competition in services (Section B). To fully realize\nBelgium's employment potential, it will be critical to address the severe fragmentation of the labor\nmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in the\nmortgage market and carefully navigate the transition toward a European Banking Union (Section D).\n\n\n\n\n3\n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector\nAssessment Program (FSAP).\n4\n  The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with\na deduction that is the product of corporate equity and a notional interest rate.\n\n\n8"
for r in re.findall(r'(?sm)^\d\d?\. {3,7}(.*?)(?=\n\n\d\d?\. |\Z)', s):
    print(r, "\n---------")

Output:

The current recovery is an opportunity to strengthen the resilience and growth
potential of the Belgian economy. The government's ability to deal with future shocks will depend
on whether it implements the right policies now while the economy continues to recover.

   First, with public debt above 100 percent of GDP and only starting to come down, Belgium still
    has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will
    require following through on plans to gradually move toward structural balance.

   Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,
    further labor and product market reforms are needed to increase productivity growth, raise
    potential output, and integrate vulnerable groups into the labor market.

   Third, although the financial sector has recovered since the crisis and is generally sound, cyclical
    vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance
    and proactive policies.3 
---------
The government agreed last summer on a new package of measures related to
taxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was
a reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be
phased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in
2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was
modified to apply only to incremental corporate equity rather than to the total stock, and new anti-
tax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the
measures are designed to enhance Belgium's competitiveness while preserving revenue neutrality. 
---------
Policy discussions focused on the importance of maintaining the reform momentum
and not yielding to complacency. Achieving the balanced budget goal will require efforts at all
levels of government to make spending more efficient and safeguard revenues (Section A).
A combination of policies and reforms could help raise productivity growth, including increasing
investment in infrastructure and enhancing competition in services (Section B). To fully realize
Belgium's employment potential, it will be critical to address the severe fragmentation of the labor
market (Section C). To preserve financial stability, the authorities should address vulnerabilities in the
mortgage market and carefully navigate the transition toward a European Banking Union (Section D).




3
 A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector
Assessment Program (FSAP).
4
  The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with
a deduction that is the product of corporate equity and a notional interest rate.


8 
---------
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • many thanks for your answer. However, if I paste it there, I don't get any match, see the demo https://regex101.com/ – math Nov 20 '18 at 13:48
  • @math I already did it for you - see [this demo](https://regex101.com/r/auWfI0/2/). And [here is a Python demo](https://rextester.com/PVC70940). – Wiktor Stribiżew Nov 20 '18 at 13:49
  • I've just noticed that my ending condition was not correct. I will change it above. It should be `\n\n` followed by a number, two or three digit and a dot. Otherwise we don't select the complete first paragraph as there are characters like `\n\n\\uf0b7` in there. I tried to change your solution to `r'(?sm)^\d\d?\. {3,7}(.*?)(?:\n\n\d\d?\. |\Z)'` but then the last paragraph is not selected – math Nov 20 '18 at 14:02
  • Because otherwise It will end at `.\n\n\uf0b7` in the first paragraph which is not correct – math Nov 20 '18 at 14:12
  • @math `\uf0b7` is a control (other) character, it is not a digit. If you need to match ASCII digits only, use `[0-9]` instead of `\d`. – Wiktor Stribiżew Nov 20 '18 at 14:14
  • many thanks for your patience. Using `r'(?sm)^\d\d?\. {3,7}(.*?)(?:\n\n\d\d?\. |\Z)'` will not select the last paragraph. I think it is because the ending of the second paragraph should also be the starting of the third. Many thanks again. Really appreciate it! – math Nov 20 '18 at 14:18
  • @math I updated the link in the answer with the solution. I do not know if that is what you need, but it now follows the new requirements. – Wiktor Stribiżew Nov 20 '18 at 14:19
  • unfortunately not in the python code. Weird that the regex demo works, but the python demo not. There the first paragraph is truncated....again thx for your patience – math Nov 20 '18 at 14:23
  • @math Here is the same solution as posted above: https://rextester.com/AZJIH30806. What is truncated here? – Wiktor Stribiżew Nov 20 '18 at 14:33