0

Trying to get the number 811.00 when its placed under the word Size.

I know how to get the number when its NEAR some word, like "Jerusalem" in this case.
But here I'm trying to get the number when it's under the word Size.

Property Size
Jerusalem 811.00
A new property agreement

Thanks, Couldn't Find any solution for this.

bobble bubble
  • 16,888
  • 3
  • 27
  • 46

2 Answers2

3

This can be accomplished by a technique introduced in vertical regex matching and requires a regex flavor with support for possessive quantifiers and forward references like PCRE or Java.

I don't know if it's worth the effort but it's certainly an interesting task by use of regex. I found the biggest challenge to keep the start of the number below above words boundaries to the left and right. In the following pattern I tried to only catch full numbers and prevent any partial matching.

^(?:.(?=.*\n(\1?+.)))*?(?=Size)(?:\w\B(?=.*\n\1?+(\2?+\D)))*+.*\n\1?+\2?+(?<![\d.])([\d.]+)
regex-part explained
^(?:.(?=.*\n(\1?+.)))*?(?=Size) captures substring from below line up to above word to $1
the first group is growing at each repetition by one character
(?:\w\B(?=.*\n\1?+(\2?+\D)))*+ captures any non-digits matching above words length to $2
\B (non word boundary) prevents skipping over the margin
.*\n\1?+\2?+(?<![\d.])([\d.]+) consumes what is captured and capturing the number to $3
the negative lookbehind prevents matching numbers partially

See this demo at regex101 or a PHP demo at tio.run - The number will be found in the third group.

Also works with .NET by getting around the possessive quantifiers using atomic groups (C# demo).
In Notepad++ ([\d.]+) can be replaced with \K[\d.]+ to reset before and finding the numbers.


More about how it works can further be found in this answer about matching a letter below another.

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
2

One solution would be to count the index of 'Size' within the first header row of the output and then use that information to extract the value under 'Size':

(?<=(\w\s){1}?)(\d+.\d+)

In the example you provided, 'Size' is the second attribute in the row, so there is one word and a space preceding the value you desire (\w\s){1}, we also know that the value is a decimal (\d+.\d+). If there were 3 attributes, you would replace the 1 with a 2...

Note: this solution assumes that every value under each attribute is a single word.

njk18
  • 153
  • 7
  • I guess you could shorten this a bit ([demo](https://regex101.com/r/h0sDWm/1)). Using `{1}` is afaik always redundant and can be spared (I could not imagine any case at least). – bobble bubble Nov 24 '22 at 01:27
  • @bobblebubble you do notice that your regex(together with the regex of the solution) does not take into account the value under the word `size` but rather only picks the digits which are preceded by a word. eg if you had another row of data with digits, those digits will be picked if preceded by a word and space. – Onyambu Nov 24 '22 at 04:38
  • @onyambu Have you tried [my demo? e.g. with `foo` above](https://regex101.com/r/5d97Iq/1) It is for extracting the number below the specified word above. It's a "vertical" match, which is the challenge :) – bobble bubble Nov 24 '22 at 04:45