Python: Split between two characters

Question

Let's say I have a ton of HTML with no newlines. I want to get each element into a list.

input = "<head><title>Example Title</title></head>"

a_list = ["<head>", "<title>Example Title</title>", "</head>"]

Something like such. Splitting between each ><.

But in Python, I don't know of a way to do that. I can only split at that string, which removes it from the output. I want to keep it, and split between the two equality operators.

How can this be done?

Edit: Preferably, this would be done without adding the characters back in to the ends of each list item.

@Carcigenicate BS4 isn't an option. This was an example not what I'm actually doing. It wasn't the question, the question is in the title. I need to split between two characters, I don't care about the example HTML. it does consistently show the split between the adjacent `><` characters, and that's what I was going for. — Jacob Birkett, Aug 15 '17 at 19:32
@spikespaz See the second answer of https://stackoverflow.com/questions/7866128/python-split-without-removing-the-delimiter. — Carcigenicate, Aug 15 '17 at 19:35

score 4 · Answer 1 · answered Aug 15 '17 at 19:45

# initial input
a = "<head><title>Example Title</title></head>"

# split list
b = a.split('><')

# remove extra character from first and last elements
# because the split only removes >< pairs.
b[0] = b[0][1:]
b[-1] = b[-1][:-1]

# initialize new list
a_list = []

# fill new list with formatted elements
for i in range(len(b)):
    a_list.append('<{}>'.format(b[i]))

This will output the given list in python 2.7.2, but it should work in python 3 as well.

score 3 · Answer 2 · answered Aug 15 '17 at 19:38

3

You can try this:

import re
a = "<head><title>Example Title</title></head>"

data = re.split("><", a)

new_data = [data[0]+">"]+["<" + i+">" for i in data[1:-1]] + ["<"+data[-1]]

Output:

['<head>', '<title>Example Title</title>', '</head>']

answered Aug 15 '17 at 19:38

Ajax1234

69,937
8
61
102

score 2 · Answer 3 · answered Aug 15 '17 at 20:16

The shortest approach using re.findall() function on extended example:

# extended html string
s = "<head><title>Example Title</title></head><body>hello, <b>Python</b></body>"
result = re.findall(r'(<[^>]+>[^<>]+</[^>]+>|<[^>]+>)', s)
print(result)

The output:

['<head>', '<title>Example Title</title>', '</head>', '<body>', '<b>Python</b>', '</body>']

score 1 · Accepted Answer · answered Aug 15 '17 at 21:48

Based on the answers by other people, I made this.

It isn't as clean as I had wanted, but it seems to work. I had originally wanted to not re-add the characters after split.

Here, I got rid of one extra argument by combining the two characters into a string. Anyways,

def split_between(string, chars):
    if len(chars) is not 2: raise IndexError("Argument chars must contain two characters.")

    result_list = [chars[1] + line + chars[0] for line in string.split(chars)]

    result_list[0] = result_list[0][1:]
    result_list[-1] = result_list[-1][:-1]

    return result_list

Credit goes to @cforemanand @Ajax1234.

score 0 · Answer 5 · answered Aug 15 '17 at 19:59

0

Or even simpler, this:

input = "<head><title>Example Title</title></head>"
print(['<'+elem if elem[0]!='<' else elem for elem in [elem+'>' if elem[-1]!='>' else elem for elem in input.split('><') ]])

answered Aug 15 '17 at 19:59

whackamadoodle3000

6,684
4
27
44

Python: Split between two characters

5 Answers5