-1

I need to convert an input-string with multipe words into a string-array via Powershell. Words can be separated by multiple spaces and/or linebreaks. Each word can be escaped by a single quote or a double quote. Some words may start with a hashtag - in that case any quoting appears after that hashtag.

Here a code sample of a possible input and the expected result:

$inputString = @"
  test1
  #custom1
  #"custom2"           #'custom3'
  #"custom ""four"""   #'custom ''five'''
  test2 "test3" 'test4'
"@

$result = @(
    'test1'
    '#custom1'
    '"#custom2"'
    "#'custom3'"
    '#"custom ""four"""'   
    "#'custom ''five'''"
    'test2' 
    '"test3"' 
    "'test4'"
)

Is there any solution to do this via a clever RegEx-expression? Or does someone have a parser-snippet/function to start with?

Carsten
  • 1,612
  • 14
  • 21
  • 1
    It's going to be a nightmare to handle the `''`/`""` escape sequences with regex, you'd be better off writing a parser by hand (read the string one character at a time and decided whether it's a continuation of the previous token or not) – Mathias R. Jessen Oct 27 '21 at 14:29
  • Thank you for the quick feedback. Would it be possible to mis-use the convertFrom-CSV servlet? – Carsten Oct 27 '21 at 14:32
  • No, it'll expect the delimiter to be uniform and it'll interpret anything starting with `#` as comment/metadata. You'll need to write your own – Mathias R. Jessen Oct 27 '21 at 14:33
  • Are the hashtags allowed to appear inside a string? If not, replacing them would make this task a lot easier. – marsze Oct 27 '21 at 14:50

1 Answers1

2

Assuming you fully control or implicitly trust the input string, you can use the following approach, which relies on Invoke-Expression, which should normally be avoided:

Assumptions made:

  • # only appears at the start of embedded strings.
  • No embedded string contains newlines itself.
$inputString = @"
  test1
  #custom1
  #"custom2"           #'custom3'
  #"custom ""four"""   #'custom ''five'''
  test2 "test3" 'test4'
"@

$embeddedStrings = Invoke-Expression @"
Write-Output $($inputString -replace '\r?\n', ' ' -replace '#', '`#')
"@

Caveat: The outer quoting around the individual strings is lost in the process and the embedded, escaped quotes are unescaped; outputting $embeddedString yields:

test1
#custom1
#custom2
#custom3
#custom "four"
#custom 'five'
test2
test3
test4

The approach relies on the fact that your embedded strings use PowerShell's quoting and quote-escaping rules; the only problems are the leading # characters, which are escaped as `# above. By replacing the embedded newlines (\r?\n) with spaces, the result can be passed as a list of positional arguments to Write-Output, inside a string that is then evaluated with Invoke-Expression, which makes Write-Output output the parsed arguments one by one, captured as an array in variable $embeddedStrings.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 1
    that assumptions are fulfilled in the given scenario. – Carsten Oct 27 '21 at 15:36
  • As for down-voting questions, @Carsten: I think that some users frown at a perceived lack of effort on the part of the asker. To me, while an attempted solution as part of the question helps, it isn't a requirement, as long as the question has a clear description of the problem. – mklement0 Oct 30 '21 at 13:18
  • I agree, its terrible to get a negative score at all for a question. I think the root cause is the fact, that for RegEx you cannot create a sample code-snippet here. Either it works or it fails. – Carsten Dec 13 '21 at 08:24