0

I ultimately want to split a string by a certain character. I tried Regex, but it started escaping \, so I want to avoid that with another approach (all the attempts at unescaping the string failed). So, I want to get all positions of a character char in a string that is not within quotes, so I can split them up accordingly.

For example, given the phase hello-world:la\test, I want to get back 11 if char is :, as that is the only : in the string, and it is in the 11th index. However, re does split it, but I get ['hello-world,lat\\test'].

EDIT: @BoarGules made me realize that re didn't actually change anything, but it's just how Python displays slashes.

DrownedSuccess
  • 123
  • 1
  • 8
  • Please post a [MCVE] of your problem. We can likely help with the regex, but it's a lot easier to fix a problem with a [MCVE] than solve your problem from scratch with a fairly vague problem description. – ShadowRanger Apr 07 '22 at 13:43
  • https://stackoverflow.com/questions/3475251/split-a-string-by-a-delimiter-in-python or https://stackoverflow.com/questions/37484624/split-string-at-delimiter-in-python or https://stackoverflow.com/questions/67032664/python-split-string-without-losing-split-character probably answers your question. – Marijn Apr 07 '22 at 13:43
  • @ShadowRanger Added one. – DrownedSuccess Apr 07 '22 at 13:46
  • @DrownedSuccess: You added an example input and output, but not the code you tried. Please provide that non-working code, as text, in the body of the question, and we can try to help you with it. – ShadowRanger Apr 07 '22 at 14:06
  • Also, side-note: Are you by any chance trying to parse lines from a pseudo-CSV format (using `:` as the field delimiter instead of `,`)? If so, don't reinvent the wheel, just use the `csv` module (it can customize the delimiter or the whole dialect as needed for just about any text format with arbitrary delimiters and quoting rules). – ShadowRanger Apr 07 '22 at 14:42
  • 1
    You are mistaken if you believe that `['hello-world,lat\\test']` is not correct, it is because you think that the \\ that you see is in the data you get back. It isn't. That is simply the visual representation of the single backslash that is really there. – BoarGules Apr 07 '22 at 14:52
  • @BoarGules This. This was actually my main problem, and my original solution worked perfectly. – DrownedSuccess Apr 07 '22 at 16:13

3 Answers3

0

Here's a function that works:

def split_by_char(string,char=':'):
    PATTERN = re.compile(rf'''((?:[^\{char}"']|"[^"]*"|'[^']*')+)''')
    return [string[m.span()[0]:m.span()[1]] for m in PATTERN.finditer(string)]
DrownedSuccess
  • 123
  • 1
  • 8
  • Two notes: 1) There's no real benefit to precompiling if you have to do it every time (you could just invoke `re.finditer(stringpat, string)`). 2) That listcomp is a really elaborate (read: verbose and inefficient) way to get the exact same result as just `return PATTERN.findall(string)`, or, if you really want `finditer`, `return [m[0] for m in PATTERN.finditer(string)]`. – ShadowRanger Apr 07 '22 at 14:11
0
string = 'hello-world:la\test'
    
char = ':'
    
print(string.find(char))

Prints

11

char_index = string.find(char)

string[:char_index]

Returns

'hello-world'

string[char_index+1:]

Returns

'la\test'
gremur
  • 1,645
  • 2
  • 7
  • 20
  • While the example they gave is poor, from the description, I think the OP needed it to *not* find the character in question if it was found inside internal quotes, thus the need for a regex. So `'hello-world:la\test'` should split, but `'hello-world":"la\test'` should not. – ShadowRanger Apr 07 '22 at 14:12
0

Solution for the case you're likely encountering (a pseudo-CSV format you're hand-rolling a parser for; if you're not in that situation, it's still a likely situation for people finding this question later):

Just use the csv module.

import csv
import io

test_strings = ['field1:field2:field3', 'field1:"field2:with:embedded:colons":field3']

for s in test_strings:
    for row in csv.reader(io.StringIO(s), delimiter=':'):
        print(row)

Try it online!

which outputs:

['field1', 'field2', 'field3']
['field1', 'field2:with:embedded:colons', 'field3']

correctly ignoring the colons within the quoted field, requiring no kludgy, hard-to-verify hand-written regexes.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271