0

I want to create a regex pattern which finds whitespaces and ignore hyphen seperated words.

The basic rule is to find any subsequent whitespaces([\s]+), and do not find whitespaces where the pattern is:

[\S]+-[\s]+[\S]+ (The pattern of which i don't want to match the whitespaces)

Any other whitespaces should match.

Matched intervals should include whitespaces only, not other characters.

For example:

abc abc

should match at position 3-4.

abc
def

should match from the end of abc to start of def.

abc-

def

should not match.

abc -

def

should match at 3-4, 5-6.

The searched string is multiline and has many occurences of whitespaces, and i want to find them all in a single search.

Tried many different patterns (with negative lookahead and lookbehind) but none was able to apply for all cases.

Using python builtin re module.

It is possible to do in two searches:

  1. search for all occurences of [\s]+

  2. search for all occurences of [\S]+-([\s]+)[\S]+

  3. remove matches of the group in (2) from matches in (1)

Is it possible to do in a single search?

Montoya
  • 2,819
  • 3
  • 37
  • 65
  • Try `r'(?<=[^\s-])\s+(?=\S)|(?<=\S)\s+(?=[^\s-])'` or `r'(?<=[^\s-])\s+(?=[^\s-])'` – Wiktor Stribiżew Mar 03 '20 at 13:20
  • Maybe this is the way, but still not working. https://regex101.com/r/qSZq9P/7 see this example. This should not match, similar to example 3 in the question – Montoya Mar 03 '20 at 13:20
  • See https://regex101.com/r/qSZq9P/9, `r'(?<=[^\s-])\s+(?=[^\s-])'` – Wiktor Stribiżew Mar 03 '20 at 13:21
  • This fix the problem above, now string like: "dsa - dsa" should match twice in the spaces and does not in current pattern. As in example (4). – Montoya Mar 03 '20 at 13:23
  • Maybe `(?<!-)\s+|(?<=\s-)\s+` https://regex101.com/r/RqNmpx/1 – The fourth bird Mar 03 '20 at 13:26
  • Try `r'(?<![^-\s]-(?=\s+[^-\s]))\s+'` – Wiktor Stribiżew Mar 03 '20 at 13:27
  • @Thefourthbird matched for: "abc- def" with multiple spaces. – Montoya Mar 03 '20 at 13:28
  • Is does not match that, should it? https://regex101.com/r/SZJ4pp/1/ – The fourth bird Mar 03 '20 at 13:30
  • @Thefourthbird see this https://regex101.com/r/qSZq9P/10. It shouldn't match it. – Montoya Mar 03 '20 at 13:31
  • Please see https://ideone.com/VhOatC, not sure what code and real input you have, it might be much simpler if we had more details. – Wiktor Stribiżew Mar 03 '20 at 13:33
  • My inputs are raw text extracted from documents crawled from the web, so many cases may happen. I wish to split the text to words, and maintain words which were seperated by hyphen in some document formats like pdf. For example Hello may appear in the document as Hel-\n\n\n\nlo, So if the regex will find this new line chars, the word Hello will be splitted in half to two words -> "Hel" and "lo". – Montoya Mar 03 '20 at 13:40
  • So, you want something like `re.findall(r'[^\s-]+(?:-\s+[^\s-]+)*', sample)`? – Wiktor Stribiżew Mar 03 '20 at 13:53
  • This matches all the normal characters well, a solution that matches all the whitespaces is preferd (Because of the algorithm that is used after that). @WiktorStribiżew. If we go that way, the only thing that does not work is a hyphen surrounded by spaces, should match also. e.g ' - '. – Montoya Mar 03 '20 at 15:44
  • But [the solution I suggested already](https://stackoverflow.com/questions/60508076/regex-find-all-whitespaces-and-ignore-hyphen-separated-words-in-multiline-stri?noredirect=1#comment107044094_60508076) matches them, see https://ideone.com/Dp8b4k – Wiktor Stribiżew Mar 03 '20 at 16:04
  • @WiktorStribiżew You pattern fails for "abc-\n\n\ndef". This should not match, see https://ideone.com/NpA7HJ – Montoya Mar 03 '20 at 16:14
  • You seem to want to perform some split operation, but this requires a regex that supports lookbehinds of variable length, or a SKIP-FAIL regex, and Python `re` does not support those. If you are willing to follow that path, you must install the PyPi regex module with `pip install regex`. If you want to continue the journey with `re`, you need to change the logic, and use the simple technique that proved useful at all times: match and capture what you need and only match what you do not need.See [this Python demo](https://ideone.com/snD7h1). – Wiktor Stribiżew Mar 03 '20 at 21:51
  • Using the `pip install regex` module is an option. What would be the pattern then? – Montoya Mar 04 '20 at 07:10

1 Answers1

0

How about this:

(?<![\s\-])[\s](?!\-\s\n)
  1. (?<![\s\-]) not look space after space and - : (negative look behind)
  2. (?!\-\s\n) not look space before - space and new line : (negative look ahead)

Edited:

Try this:

(?<![\s])[\s+](?!\-\s\n)(?!\n{2})

After review your sample data, some of it contains double new line so:

  1. (?<![\s]) not look after space
  2. (?!\-\s\n) not look before dash, space, and newline
  3. (?!\n{2}) not look before double newline
dhentris
  • 198
  • 1
  • 2
  • 10
  • Does not work for: "abc - abc". Two matches should be produced. One before the '-' and one after. – Montoya Mar 03 '20 at 15:37
  • After edit - fails for "abc\n\ndef". It should match any whitespaces, except those specified in the question. – Montoya Mar 08 '20 at 07:39
  • after all I thought it only can be done with multiple step regex because your data have slightly same conditions between the accepted space and not. lets wait may be somebody have another solution. – dhentris Mar 08 '20 at 09:30