4

The question originating from https://stackoverflow.com/a/53750697/856090 answer.

We receive an "input" string.

The input string is split into several "commands" by +s that is by \s+\+\s+ regexp. However while splitting quoted + (\+) shall be ignored.

Every command is then split into several "arguments" by whitespace characters, but quoted (\) whitespace is not counted on splitting and instead becomes a part of an argument.

Quoted \ (that is \\) becomes regular characted \ and itself is not participated in quoting.

My solution is to process the input string char-by-char with special behavior for \, +, and whitespace characters. This is slow and not elegant. I ask for an alternative solution (such as by using regexps).

I write in Python 3.


For example,

filter1 + \
chain -t http://www.w3.org/1999/xhtml -n error + \
transformation filter2 --arg x=y

transformation filter3

becomes

[['filter1'],
 ['chain', '-t', 'http://www.w3.org/1999/xhtml', '-n', 'error'],
 ['transformation', 'filter2', '--arg', 'x=y']]

and

a \+ b + c\ d

becomes

 [['a', '+', 'b'], ['c d']]
porton
  • 5,214
  • 11
  • 47
  • 95
  • Please give an example of input and expected output (which you wish to get after splitting). – hygull Dec 13 '18 at 01:40
  • @hygull examples added – porton Dec 13 '18 at 01:51
  • If no one solves, I will try to solve, actually I am in mobile right now. Got it.Thank you. – hygull Dec 13 '18 at 01:55
  • 1
    Finally, I solved your problem in mobile at rextester. Crazy exciting question. I am writing answer now. Thank you. – hygull Dec 13 '18 at 02:30
  • What if you will get `[['a', '+', 'b'], ['c', 'd']]` in place of `[['a', '+', 'b'], ['c d']]`. Actually, both contains spaces for separation after 1st split operation, or we will need to pass extra parameters for this kind of operation. I have also tried to obtain 2nd result, I got but it failed for 1st, so I guessed that we may need extra parameters for that. So I think, if you wish or if my suggested O/P is okay then I will edit or I will try other methods to solve. Thank you. – hygull Dec 13 '18 at 03:15
  • @hygull I don't understand wording of your comment. If you mean that `[['a', '+', 'b'], ['c', 'd']]` in place of `[['a', '+', 'b'], ['c d']]` is OK, then no, it is not OK – porton Dec 13 '18 at 03:21
  • Okay, so let me look for more options, thanks for the update. – hygull Dec 13 '18 at 03:28
  • @parton, I have updated my code based on your provided input set, please check. Thank you. – hygull Dec 13 '18 at 19:19

2 Answers2

0

Here is the answer of your problem.

Here the function get_splitted_strings_for() takes 1 parameter of type string s, and splits 1 by 1, 2 times and finally it stores result in 2d list.

import re

def get_splitted_strings_for(s): 
    splits = []
    splits1 = re.split(r"\s*\+\s+\\\s*|\s+\+\s+", s)

    for split in splits1: 
        if "\+" in split: 
            split = split.replace("\\",  "") 
            splits.append(split.split()) 
        elif "\\" in split: 
            splits.append([split.replace("\\", "")]) 
        else: 
            arr = re.split(r"\s+", split.replace("\\", '')) 
            splits.append(arr) 

    return splits

s = "filter1 + \ chain -t http://www.w3.org/1999/xhtml -n error + \ transformation filter2 --arg x=y"
print(get_splitted_strings_for(s))

# [['filter1'], ['chain', '-t', 'http://www.w3.org/1999/xhtml', '-n', 'error'], ['transformation', 'filter2', '--arg', 'x=y']]

print()  # New line

s2 = "a \+ b + c\ d"
print(get_splitted_strings_for(s2))
# [['a', '+', 'b'], ['c d']]
hygull
  • 8,464
  • 2
  • 43
  • 52
  • 1
    Wrong! It should be `[['a', '+', 'b'], ['c d']]` – porton Dec 13 '18 at 02:35
  • Okay sorry for that, let me fix that. Thank you. – hygull Dec 13 '18 at 02:37
  • I tried to get 2nd output in several ways but I got spaces in the substrings. So I suggest if `[['a', '+', 'b'], ['c', 'd']]` will help you then it will be better otherwise it introduces extra parameters in function's argument list which again needs more input from your side. Currently, I have updated my answer for this only. Thank you. – hygull Dec 13 '18 at 03:26
  • I do not understand you: "help you then it will be better otherwise it introduces extra parameters". What does that mean? – porton Dec 13 '18 at 03:33
  • I think that is not required, let me try in other ways. Thank you. – hygull Dec 13 '18 at 04:13
  • Based on your provided input set, I have re-tried to look and updated my code. Now it works for both of your input set. Please check. Thank you. – hygull Dec 13 '18 at 19:17
  • Your code parses `r'a\\ \+ b + c\ d'` wrongly. It should output `[['a\\', '+', 'b'], ['c d', '+']]` – porton Dec 14 '18 at 07:13
0

I wrote my own version of the routine:

import re


def split_pipeline(s):
    res = [['']]
    r = r'\\\\|\\\+|\\\s|\s+\+\s+|\s+|[^\s\\]+'
    for m in re.finditer(r, s, re.M|re.S):
        if m[0][0] == '\\':
            res[-1][-1] += m[0][1:]
        elif re.match(r'^\s+\+\s+$', m[0], re.M|re.S):
            res.append([''])
        elif re.match(r'^\s+$', m[0], re.M | re.S):
            res[-1].append('')
        else:
            res[-1][-1] += m[0]
    return res

print(split_pipeline(r'a\\ \+  b + c\ d'))
# [['a\\', '+', 'b'], ['c d']]
porton
  • 5,214
  • 11
  • 47
  • 95