1

I am trying to work up an expression to match the firm name in addresses like

Vice President of Compliance
10004 South 152nd St. #A
Omaha

I tried using the following expression to match the Vice President of Compliance string but it doesn't seem to be working. Effectively, I am trying to match the string that is preceding the beginning of the address.

\w.+(?=\d+(?=\s+))

Can someone please guide me on this

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
pb_ng
  • 361
  • 1
  • 5
  • 19
  • 1
    Whitespace is what follows that line, not digits. Have a look at https://regex101.com/r/EIXX42/1. – Wiktor Stribiżew Jan 27 '17 at 10:31
  • Thanks again, Wiktor. I did it to match a whitespace but it is now matching both the lines - https://regex101.com/r/kq6nth/1 Kindly advise – pb_ng Jan 27 '17 at 10:34
  • 1
    You have `g` global modifier enabled. Match just once. Also, `\s` matches any whitespace. Do you really need that or do you intend to match line breaks? See https://regex101.com/r/kq6nth/2. – Wiktor Stribiżew Jan 27 '17 at 10:35
  • I think a line break would be more apt in this case, because I would like to match the line preceding the line of the address. Thanks again – pb_ng Jan 27 '17 at 10:39
  • I am afraid you are reinventing a wheel: there are lots of address parsing libraries out there. Look, a [Python one here](https://github.com/datamade/usaddress) that parses US unstructured address strings into address components. Besides, you have not precised what the input can look like. Maybe the firm name is always the first line, then why not just split the string with a newline and get the first item? It is a bit difficult to answer a question in a nice way without details. – Wiktor Stribiżew Jan 27 '17 at 10:42
  • I am actually using these expressions with a third party web scraping tool (WebHarvy) so I cannot use libraries there. Also not all the records have firm names or titles listed. Some of them just have the addresses – pb_ng Jan 27 '17 at 10:46

1 Answers1

1

Your \w.+(?=\d+(?=\s+)) pattern matches a word char (anywhere in the input) followed with any 1+ chars other than line break chars (so, \w cannot match the last char on the line) and that text must be followed with 1+ digits that must be followed with any 1+ whitespace symbols. That means, you do not really match the first line where the firm name is.

If you plan to match a line that starts with a word char, then has any 1+ chars followed with a line break and some digits you may use

^\w.+(?=\r?\n\d)

See the regex demo.

Since WebHarvy will "extract only those portion(s) of the main text which matches the group(s) specified in the RegEx string" and it seems the regex flavor is ECMAScript 5, you may use

(?:\n|^)(\w.+)\r?\n\d
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks a lot, Wiktor for checking out WebHarvy for me. This is super helpful for me understanding the nuances of ReGEX better. The expression you have provided, matches the second line as shown in the demo. Could you please tell me how I can edit the expression to match the first line 'More text here' only? – pb_ng Jan 27 '17 at 11:39
  • 1
    The first line can be matched with a mere `^.+` regex, or `^.*` (if it can be empty). – Wiktor Stribiżew Jan 27 '17 at 11:51
  • Thanks, Wiktor. Could you please suggest me a good resource where I can learn regex? – pb_ng Jan 27 '17 at 11:55
  • 1
    I can only suggest doing all lessons at [regexone.com](http://regexone.com/), reading through [regular-expressions.info](http://www.regular-expressions.info), [regex SO tag description](http://stackoverflow.com/tags/regex/info) (with many other links to great online resources), and the community SO post called [What does the regex mean](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean). Also, [rexegg.com](http://rexegg.com) is worth having a look at. – Wiktor Stribiżew Jan 27 '17 at 11:57
  • Thanks a ton. I am actually following regexone.com now and going through their exercises :) – pb_ng Jan 27 '17 at 12:00