Python split() without removing the delimiter

Question

This code almost does what I need it to..

for line in all_lines:
    s = line.split('>')

Except it removes all the '>' delimiters.

So,

<html><head>

Turns into

['<html','<head']

Is there a way to use the split() method but keep the delimiter, instead of removing it?

With these results..

['<html>','<head>']

This doesn't really answer your question, but if you're trying to parse HTML in Python, I highly recommend [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/). — Michael Mior, Oct 23 '11 at 12:33
See also [In Python, how do I split a string and keep the separators?](http://stackoverflow.com/questions/2136556/in-python-how-do-i-split-a-string-and-keep-the-separators). — outis, Oct 23 '11 at 12:44
This question should be reopened. The duplicate one is regex-specific. — orestisf, Apr 26 '20 at 05:06
@orestisf Also, the "duplicate" one answers a different problem. `['', '', '']` is different from `['', '']`. I know it's been a few months but I just voted to reopen. If you do too someone else make take it over the finish line? — user1717828, Oct 26 '20 at 00:37
re.split(r"(?<=>(?!$))", '') directly gives the answer. This way it can be handled by playing with regex look-arounds — dgor, Dec 31 '20 at 07:46

score 78 · Accepted Answer · edited Feb 15 '17 at 04:02

78

d = ">"
for line in all_lines:
    s =  [e+d for e in line.split(d) if e]

edited Feb 15 '17 at 04:02

Wes Modes

2,024
2
22
40

answered Oct 23 '11 at 12:38

P.Melch

8,066
43
40

12

That works perfectly... but I don't fully understand what's going on. – some1 Oct 23 '11 at 12:43
7

@some1 it basically iterates over the results of the split and adds the delimiter back in. "s is a list, where each element in that list is e + d, where e are the elements in the result of line.split(d), but only if e isn't empty" – JHixson Jun 26 '14 at 17:04
14

This adds a delimiter to all elements of the resulting list, including a single-element list with no delimiter... What if you _only_ wanted the delimiter appended to the first of the split elements? – The Pied Pipes Jun 29 '14 at 00:01
Very old post but for the record: `if e` is enough, `!=""` can be omitted. – mikuszefski Feb 01 '17 at 14:37
19

this is sloppy. what if the string is "a.b." or ".a.b." and split on "." – thang Sep 19 '17 at 17:28
And what if youre using regex to split the string? – J-Cake Apr 20 '18 at 13:27
@kasheemlew what you mean if d wasn't ">" but [">", "<"]? So instead of a string int was a list of strings? – P.Melch Oct 26 '19 at 08:54
Fails with `"a..b"` as well – orestisf Apr 25 '20 at 17:26
I've posted an answer here https://stackoverflow.com/a/61436083/3430986 that should cover all edge cases – orestisf Apr 26 '20 at 05:09
3

This will add a delimiter at the end of the string even if it don't exist at the end in original string. – Vijayendra May 05 '21 at 11:29

score 38 · Answer 2 · answered Oct 23 '11 at 14:54

38

If you are parsing HTML with splits, you are most likely doing it wrong, except if you are writing a one-shot script aimed at a fixed and secure content file. If it is supposed to work on any HTML input, how will you handle something like <a title='growth > 8%' href='#something'>?

Anyway, the following works for me:

>>> import re
>>> re.split('(<[^>]*>)', '<body><table><tr><td>')[1::2]
['<body>', '<table>', '<tr>', '<td>']

answered Oct 23 '11 at 14:54

gb.

646
5
11

If you are not sure whether the string in question will end with the deliminator in question, looks like you can do: `re.split("(.*\n?)", "my\nstr\ning")[1::2]` – Seth Robertson Oct 11 '18 at 17:34
If you want to be parsing html, should go to https://automatetheboringstuff.com/2e/chapter12/ and read this chapter. Has everything you need to know about parsing html and webscraping. If this link ever breaks, look into using the requests, beautifulsoup, and selenium libraries. – zicameau Feb 11 '22 at 23:39

score 21 · Answer 3 · answered Oct 23 '11 at 12:45

21

How about this:

import re
s = '<html><head>'
re.findall('[^>]+>', s)

answered Oct 23 '11 at 12:45

Óscar López

232,561
37
312
386

score 1 · Answer 4 · answered Oct 23 '11 at 12:33

1

Just split it, then for each element in the array/list (apart from the last one) add a trailing ">" to it.

answered Oct 23 '11 at 12:33

orangething

708
5
16

3

What about the case of ">>" it would just become ">" – paulm Mar 21 '16 at 09:06
@paulm no, because splitting two `>`s like in `">body".split('>')` creates an empty element in the middle `["`s to result in just a single `>` after processing, in which case you could first remove those empty strings. – yyny Sep 28 '18 at 08:25

Python split() without removing the delimiter

4 Answers4

Linked

Related