-2

Is there a regex to extract all spaces that separate key+value pairs and ignoring those delimited by double quotes

sample:

key1=value1 key1=value1 spaces="some spaces in text" nested1="key2=value2 key2=value2 key2=value2"  nested2="key2=value2, key2=value2, key2=value2" quoted="his name is \"no body\""

this is where i come for so far: (?<!,) (?=\w+=), but of course it doesn't work.

Abhishek Bhagate
  • 5,583
  • 3
  • 15
  • 32
milevyo
  • 2,165
  • 1
  • 13
  • 18
  • You'd better match them using `(\w+)=(?:"([^"\\]*(?:\\.[^"\\]*)*)"|(\S+))`. See https://regex101.com/r/ShFlKl/1/ – Wiktor Stribiżew Jun 26 '20 at 20:34
  • thank you @WiktorStribiżew but i am interested on the spaces (separators) not the data – milevyo Jun 26 '20 at 20:37
  • so yuiod like to match spaces="some spaces in text" `` nested1="key2=value2 key2=value2" ? sure is if using the right regex engine, what thadt be ? –  Jun 26 '20 at 20:44
  • @Edward Exactlly, – milevyo Jun 26 '20 at 20:47
  • Partly modified regex will match whitespaces you need, see https://regex101.com/r/ShFlKl/2 - it will work with PCRE, PyPi regex and PCRE.NET library in .NET. But there can be other workarounds if you explain what you are doing and what the programming language and regex flavor are. – Wiktor Stribiżew Jun 26 '20 at 20:48
  • 1
    You can match overlapped, the first half + spaces, then write back the first half. OR if using PCRE or Perl can ignore first half with `\K` i won't dangle regex in front of yuio, when yuior ready, let me knoew –  Jun 26 '20 at 20:48
  • @Edward, You are awesome, can you post your answer no matter how it is written i will accepte it. – milevyo Jun 26 '20 at 21:00
  • sure will post something –  Jun 26 '20 at 21:05
  • @Edward, i already resolved the question thanks to you. your comment by itself is good for an answer. – milevyo Jun 26 '20 at 21:09
  • When you give an example it's generally helpful to the reader to show the expected result, which here would be a string. Although you've said what you want that might be misinterpreted. I assume that key value pairs are identified by the equals sign, but that should also be clarified. – Cary Swoveland Jun 26 '20 at 22:30

3 Answers3

1

[^\s="]+\s*=\s*(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|[^\s=]+)\K[ \t]+

PCRE demo

No need to write back. just matches space delimiters.
can replace with new delimiter


([^\s="]+\s*=\s*(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|[^\s=]+))([ \t]+)

Python demo

Can write back \1 or \2 if needed.
can replace with new delimiter


note - the part of the above expressions matching the field info
could benifit by placing Atomic group around (?>) but not strictly
necessary as the field structure is fairly concise.

are other options to garantee integrity as well like matching every
character with the use of the \G anchor if availibul.
let me know if need this approach.

many ways to go here

  • better than what i did `\w+=\"[^\"]+\"\K\s|\w+=[^\"]\S{0,}\K\s`. many thanks – milevyo Jun 26 '20 at 21:49
  • Nice one, Edward. I have this theory that you initially write your answers in flawless English--no grammatical or spelling errors--because that's the fastest way, then you get to work on the text with your ball-peen hammer... – Cary Swoveland Jun 26 '20 at 22:25
  • check out this regex to match a complete json object https://regex101.com/r/H8datD/1 or https://regex101.com/r/H8datD/2 –  Jun 27 '20 at 22:26
1

Here is another option:

".*?(?<!\\)"(*SKIP)(*F)| +

See the online demo

Please do let me know if it actually does what is required as I'm unsure. Anyways, here is a breakdown:

  • " - A literal double quote.
  • .*? - Anything but newline zero or more times but lazy.
  • (?<!\\) - A negative lookbehind for \.
  • " - A literal double quote.
  • (*SKIP)(*F) - Consume all characters of matches, force a failure and continue matching.
  • | - Alternation.
  • + - One or more space characters.

If it's Python you are using, you'll need a reference to the PypI regex module.

JvdV
  • 70,606
  • 8
  • 39
  • 70
  • 1
    Good one! (I think, anyway, as I have learned only the basics of how SKIP/FAIL can be used to advantage. It's at the top of my list, however, so I'll be back soon to dig into your answer.) – Cary Swoveland Jun 27 '20 at 00:30
  • 1
    Thanks @CarySwoveland. I am also new to this construct but found that the explaination [here](https://stackoverflow.com/a/24535912/9758194) was rather clear and helpfull. – JvdV Jun 27 '20 at 10:10
1

You could do that with the following PCRE-compatible regular expression.

\G[^" \n]*(?:(?<!\\)"(?:[^\n"]|(?<=\\)")*(?<!\\)"[^" \n]*)*\K +

Start your engine!

\G           : assert position at the end of the previous match
               or the start of the string for the first match
[^" \n]*     : match 0+ chars other than those in char class 
(?:          : begin non-capture group
  (?<!\\)    : use negative lookbehind to assert next char is not 
               preceded by a backslash 
  "          : match double-quote
  (?:        : begin non-capture group
    [^"\n]   : match a char other than those in char class
    |        : or
    (?<=\\)  : use positive lookbehind to assert next char is 
               preceded by a backslash
    "        : match double-quote
  )          :end non-capture group
  *          : match non-capture group 0+ times
  (?<!\\)    : use negative lookbehind to assert next char is not 
  "          : match double-quote
  [^" \n]*   : match 0+ chars other than those in char class 
)            : end non-capture group
*            : match non-capture group 0+ times
\K           : forget everything matched so far and reset start of match
\ +          : match 1+ spaces
Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100