130

This code almost does what I need it to..

for line in all_lines:
    s = line.split('>')

Except it removes all the '>' delimiters.

So,

<html><head>

Turns into

['<html','<head']

Is there a way to use the split() method but keep the delimiter, instead of removing it?

With these results..

['<html>','<head>']
some1
  • 2,447
  • 8
  • 26
  • 23
  • 22
    This doesn't really answer your question, but if you're trying to parse HTML in Python, I highly recommend [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/). – Michael Mior Oct 23 '11 at 12:33
  • 2
    See also [In Python, how do I split a string and keep the separators?](http://stackoverflow.com/questions/2136556/in-python-how-do-i-split-a-string-and-keep-the-separators). – outis Oct 23 '11 at 12:44
  • 10
    This question should be reopened. The duplicate one is regex-specific. – orestisf Apr 26 '20 at 05:06
  • 2
    @orestisf Also, the "duplicate" one answers a different problem. `['', '', '']` is different from `['', '']`. I know it's been a few months but I just voted to reopen. If you do too someone else make take it over the finish line? – user1717828 Oct 26 '20 at 00:37
  • 1
    re.split(r"(?<=>(?!$))", '') directly gives the answer. This way it can be handled by playing with regex look-arounds – dgor Dec 31 '20 at 07:46

4 Answers4

78
d = ">"
for line in all_lines:
    s =  [e+d for e in line.split(d) if e]
Wes Modes
  • 2,024
  • 2
  • 22
  • 40
P.Melch
  • 8,066
  • 43
  • 40
  • 12
    That works perfectly... but I don't fully understand what's going on. – some1 Oct 23 '11 at 12:43
  • 7
    @some1 it basically iterates over the results of the split and adds the delimiter back in. "s is a list, where each element in that list is e + d, where e are the elements in the result of line.split(d), but only if e isn't empty" – JHixson Jun 26 '14 at 17:04
  • 14
    This adds a delimiter to all elements of the resulting list, including a single-element list with no delimiter... What if you _only_ wanted the delimiter appended to the first of the split elements? – The Pied Pipes Jun 29 '14 at 00:01
  • Very old post but for the record: `if e` is enough, `!=""` can be omitted. – mikuszefski Feb 01 '17 at 14:37
  • 19
    this is sloppy. what if the string is "a.b." or ".a.b." and split on "." – thang Sep 19 '17 at 17:28
  • And what if youre using regex to split the string? – J-Cake Apr 20 '18 at 13:27
  • @kasheemlew what you mean if d wasn't ">" but [">", "<"]? So instead of a string int was a list of strings? – P.Melch Oct 26 '19 at 08:54
  • Fails with `"a..b"` as well – orestisf Apr 25 '20 at 17:26
  • I've posted an answer here https://stackoverflow.com/a/61436083/3430986 that should cover all edge cases – orestisf Apr 26 '20 at 05:09
  • 3
    This will add a delimiter at the end of the string even if it don't exist at the end in original string. – Vijayendra May 05 '21 at 11:29
38

If you are parsing HTML with splits, you are most likely doing it wrong, except if you are writing a one-shot script aimed at a fixed and secure content file. If it is supposed to work on any HTML input, how will you handle something like <a title='growth > 8%' href='#something'>?

Anyway, the following works for me:

>>> import re
>>> re.split('(<[^>]*>)', '<body><table><tr><td>')[1::2]
['<body>', '<table>', '<tr>', '<td>']
gb.
  • 646
  • 5
  • 11
  • If you are not sure whether the string in question will end with the deliminator in question, looks like you can do: `re.split("(.*\n?)", "my\nstr\ning")[1::2]` – Seth Robertson Oct 11 '18 at 17:34
  • If you want to be parsing html, should go to https://automatetheboringstuff.com/2e/chapter12/ and read this chapter. Has everything you need to know about parsing html and webscraping. If this link ever breaks, look into using the requests, beautifulsoup, and selenium libraries. – zicameau Feb 11 '22 at 23:39
21

How about this:

import re
s = '<html><head>'
re.findall('[^>]+>', s)
Óscar López
  • 232,561
  • 37
  • 312
  • 386
1

Just split it, then for each element in the array/list (apart from the last one) add a trailing ">" to it.

orangething
  • 708
  • 5
  • 16
  • 3
    What about the case of ">>" it would just become ">" – paulm Mar 21 '16 at 09:06
  • @paulm no, because splitting two `>`s like in `">body".split('>')` creates an empty element in the middle `["`s to result in just a single `>` after processing, in which case you could first remove those empty strings. – yyny Sep 28 '18 at 08:25