0

I am trying to parse and clean up some poorly formatted logs, which often have an excess of spaces. So basically I want to replace more than one space with one space. However, there are things that occur within quotes where the extra spaces are not extraneous, and I don't want to replace those. I have found plenty of resources that talk about replacing multiple spaces with one, but getting the negation, to not do it when inside of quotes, is giving me grief. I really wonder sometimes why RegEx logic just messes with my head so much.

EDIT: Examples

Jrn.Size        0 ,   3317 ,   1549

becomes

Jrn.Size 0 , 3317 , 1549

and

Jrn.Directive "GlobalToProj"   , "[File   Name.rvt]"

becomes

Jrn.Directive "GlobalToProj" , "[File   Name.rvt]"

The extra spaces after "GlobalToProj" are replaced, but the extra spaces in "[File Name.rvt]" are not.

Gordon
  • 6,257
  • 6
  • 36
  • 89
  • 2
    show the input text and the expected output to obtain a quick help – RomanPerekhrest Jan 01 '17 at 19:00
  • 1
    And please tag the question with the language/framework/platform you're using – Mathias R. Jessen Jan 01 '17 at 19:01
  • Revised for both comments. – Gordon Jan 01 '17 at 19:12
  • This is probably easier done with lexing/parsing than a regex, especially if escaped quotes can appear inside other quotes. Tracking matching quotes in regexes is a nightmare – Andy Ray Jan 01 '17 at 19:15
  • Well, no chance of escaped quotes, so that's something. And it's reassuring that I am finding a hard thing hard, rather than an easy thing hard. ;) So, I am thinking I use regex to find things enclosed in quotes, replace those quotes with tokes, then replace multiple spaces with one, then replace the tokens. Which sounds like a lot of work and the kind of thing that maybe PowerShell has some cmdlets for? Or more something that PERL is good at, and I might need to roll my own for PowerShell? – Gordon Jan 01 '17 at 19:21
  • is it always would have a fixed format with comma as the delimiter for multiple substrings? – RomanPerekhrest Jan 01 '17 at 19:25
  • Roman, I THINK that's true, but I am dealing with log files of many hundreds of lines, and of course Autodesk won't document the syntax. My guess is because they don't even know themselves. I might need to get the rest of what I am working on working, then run it on some really massive files to get a better sense of what conditions I really need to address are. Unfortunately I can see some situations where there might be commas in quotes, meaning there could be extraneous spaces in association with commas... in quotes. Sigh. – Gordon Jan 01 '17 at 19:29
  • Try to -replace with `("[^"]*")|( ){2,}` pattern and replace with `$1$2`, see [this demo](https://regex101.com/r/BEsWVX/1). – Wiktor Stribiżew Jan 01 '17 at 19:31
  • I think it would be easier by dividing the problem in 3 steps. As first, you could replace all spaces between quotes with another character who 's never used in the source file (if there's one). Then you could replace multiplace spaces with one space and in conclusion you could replace back the wild charater of the first step with space character. – Fabrizio Jan 02 '17 at 10:48

1 Answers1

1

You can use this ingenious approach to test whether a match is follow by an even or odd number of quotes in order to determine whether we're inside or outside a quoted piece of text:

PS C:\> 'Jrn.Directive "GlobalToProj"   , "[File   Name.rvt]"' -replace '\s+(?=([^"]*"[^"]*")*[^"]*$)',' '
Jrn.Directive "GlobalToProj" , "[File   Name.rvt]"

The pattern itself:

\s+(?=([^"]*"[^"]*")*[^"]*$)

breaks down to:

\s+         # one or more spaces followed by
(?=         # positive lookahead group containing
  (         # capture group containing
    [^"]*   # 0 or more non-doublequote characters
    "       # 1 doublequote mark
    [^"]*   # 0 or more non-doublequote characters
    "       # 1 doublequote mark
  )*        # group repeated 0 or more times
  [^"]*     # 0 or more non-doublequote characters
  $         # end of string
)           
Community
  • 1
  • 1
Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206
  • Oh, that is just badass! I have a new one I would like to do, and this breakdown might finally get me past my comprehension block when it comes to RegEx. – Gordon Jan 02 '17 at 20:09