1

I'm new to the Regex world and I've browse many site without finding what I'm looking for. I have a file where I need to fetch the address. The address is align-left of the paper (there's text in the same line at the right).

Some information on multiple line (6)
that I don't need and can't paste because
it contains some personal information. 
So imagine a lot of text here...
So imagine a lot of text here...
So imagine a lot of text here...

Sold To                                              Bill To
Some Cie                                             Some Other Cie
1111 chemin some-road                                2222 chemin some-other-road
City-Here QC J0Q 1W0                                 Other City-Here QC J0Q 1W0 
Canada                                               Canada

I need to fetch the text in the 'Sold To' side. I tried to use the \r but it returns nothing! I don't know how to fetch the text from the start of the line until there's a bunch of spaces. Ex: Some Cie (if more than 1 spaces, go to next line)

then I have: Sold\sTo(?=\s{2,100}) but it won't work while (?=\s{2, 100}) returns everything!!!

I saw this: ^((?:\S+\s+){2}\S+).*, which is very close to what I want, but I don't understand the whole thing. I would like to match from 2 to 5 words.

Then I have this: ^([A-Za-z0-9-]*)(?=\s{2,100}) which I thought would match At the beginning of the line until there's more than 2 spaces. What am I getting wrong?

I need to do this in pure Regex (no flags allowed).

I'm completely lost. Some guidance would be much appreciated.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
mr info
  • 33
  • 2

1 Answers1

1

You're pretty close on your last attempt. Here's what I came up with:

^.+?(?=[^\S\n]{2,})

Explanation:

  • .+ - One or more characters
    • ? - Non-greedy, to give the next part priority, i.e. avoid matching a bunch of spaces
  • [^\S\n] - Any whitespace character except newline (this is like \s minus \n)
    • {2,} - Two or more

Matches from the example:

Sold To
Some Cie
1111 chemin some-road
City-Here QC J0Q 1W0
Canada

Try it on Regex101

Simple example in Python:

import re

pattern = re.compile(r'^.+?(?=[^\S\n]{2,})')

with open(filename) as f:
    for line in f:
        m = pattern.match(line)
        if m:
            print(m.group())
wjandrea
  • 28,235
  • 9
  • 60
  • 81
  • It's working, but only if flag 'Global' and 'multiline' enabled. Is there a way to make it work without specifying those? I'm using a tool that doesn't allow me to specify any flags – mr info Jan 27 '20 at 18:26
  • I found out (m?) enabled multiline. Now I'm pretty sure the issue is coming from my tool. Thanks! – mr info Jan 27 '20 at 18:50
  • @mrinfo Global is not a real flag in Python. Instead you use a different function, like `re.search` vs `re.findall`. Regex101 just uses it for convenience. As well you don't need multiline if you iterate over the lines. I posted a simple example. – wjandrea Jan 27 '20 at 19:24
  • in fact, it's because I'm using a tool that uses python, I can't write my own python method, only regex in a yml file. To be honest, it's kinda weird... When I use my real file, it's fetching way too much line, I need to add some delimiters in there I think – mr info Jan 28 '20 at 14:21