3

Language being used: PHP

Lets say I have an expression like this:

Ayala NOT ("Ayala Station" OR "Ayala Branch" OR "Joey Ayala")

And I want to extract the following words:
- Ayala
- Ayala Station
- Ayala Branch and
- Joey Ayala

I want to retrieve all phrases enclosed in double quotation mark " " and stand-alone words like the Ayala in the example above, but failed with experiments

Tried multiple regex

1st attempt:

"([^"]+)" - I'm aware that this regex is the correct one for getting words/phrases inside double quotation mark

2nd attempt:

~\w+(?:-\w+)*~ - this regex will get all words from a given expression or string

3rd attempt:

Combining the 2 attempts above "([^"]+)"|~\w+(?:-\w+)*~ - I was able to produce my use cases for my desired output but with these 2 combined, the Ayala word isn't being extracted

Example playground regex101

4th attempt:

Tried using "([^"]+)"|\S+ but it is including the special characters

4th Attempt

Am I missing something with the regex?

Suomynona
  • 639
  • 1
  • 5
  • 20
  • NOT and OR are both standalone words too, why don't they get matched too? Are they just special exceptions? – CertainPerformance Oct 04 '19 at 02:54
  • they're also being matched too sir, I just use this code to omit them `$arrkeywords = array_map('strtolower', $arrkeywords); $arrkeywords = array_diff($arrkeywords, array("or", "and"));` – Suomynona Oct 04 '19 at 02:55
  • 1
    Remove the `~` delimiters - delimiters should only occur at the very start and very end of the pattern https://regex101.com/r/nP6wM5/8 – CertainPerformance Oct 04 '19 at 02:58
  • excellent Captain! may you kindly give an answer below so I can upvote and mark your comment as the answer? :) – Suomynona Oct 04 '19 at 03:05
  • 1
    [Another idea at regex101](https://regex101.com/r/XupiH4/1) by use of [branch reset](https://www.regular-expressions.info/branchreset.html) for getting matches in **group 1** and [`(*SKIP)(*F)`](https://stackoverflow.com/questions/24534782/how-do-skip-or-f-work-on-regex) to skip the unwanted. Delimiter issue has been pointed out already. Further I don't think it's a good idea to work with lookarounds and balanced quotes (eg [see this](https://regex101.com/r/nP6wM5/14)). – bobble bubble Oct 04 '19 at 09:48

1 Answers1

1

The right side of the alternation should not have regex delimiters - the regex delimiters should only be around the entire pattern (next to the PHP string delimiters). Eg

"([^"]+)"|\w+(?:-\w+)*

If you want to capture all matches at once, without capture groups, you can use

(?<=")\b[^"]+(?=")|\w+(?:-\w+)*

https://regex101.com/r/nP6wM5/10

To exclude NOT and OR in the regex itself, use:

(?<=")\b[^"]+(?=")|\b(?!(?:NOT|OR)(?!-))\w+(?:-\w+)*

to negative lookahead for them right before matching the standalone words.

CertainPerformance
  • 356,069
  • 52
  • 309
  • 320