1

Let's say I have a ton of HTML with no newlines. I want to get each element into a list.

input = "<head><title>Example Title</title></head>"

a_list = ["<head>", "<title>Example Title</title>", "</head>"]

Something like such. Splitting between each ><.

But in Python, I don't know of a way to do that. I can only split at that string, which removes it from the output. I want to keep it, and split between the two equality operators.

How can this be done?

Edit: Preferably, this would be done without adding the characters back in to the ends of each list item.

Ajax1234
  • 69,937
  • 8
  • 61
  • 102
Jacob Birkett
  • 1,927
  • 3
  • 24
  • 49
  • Please post your desired output from `a_list`. – Ajax1234 Aug 15 '17 at 19:30
  • 1
    @Carcigenicate BS4 isn't an option. This was an example not what I'm actually doing. It wasn't the question, the question is in the title. I need to split between two characters, I don't care about the example HTML. it does consistently show the split between the adjacent `><` characters, and that's what I was going for. – Jacob Birkett Aug 15 '17 at 19:32
  • @Ajax1234 The example list is the output I need. – Jacob Birkett Aug 15 '17 at 19:34
  • @spikespaz See the second answer of https://stackoverflow.com/questions/7866128/python-split-without-removing-the-delimiter. – Carcigenicate Aug 15 '17 at 19:35

5 Answers5

4
# initial input
a = "<head><title>Example Title</title></head>"

# split list
b = a.split('><')

# remove extra character from first and last elements
# because the split only removes >< pairs.
b[0] = b[0][1:]
b[-1] = b[-1][:-1]

# initialize new list
a_list = []

# fill new list with formatted elements
for i in range(len(b)):
    a_list.append('<{}>'.format(b[i]))

This will output the given list in python 2.7.2, but it should work in python 3 as well.

cforeman
  • 59
  • 1
  • 6
3

You can try this:

import re
a = "<head><title>Example Title</title></head>"

data = re.split("><", a)

new_data = [data[0]+">"]+["<" + i+">" for i in data[1:-1]] + ["<"+data[-1]]

Output:

['<head>', '<title>Example Title</title>', '</head>']
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
2

The shortest approach using re.findall() function on extended example:

# extended html string
s = "<head><title>Example Title</title></head><body>hello, <b>Python</b></body>"
result = re.findall(r'(<[^>]+>[^<>]+</[^>]+>|<[^>]+>)', s)
print(result)

The output:

['<head>', '<title>Example Title</title>', '</head>', '<body>', '<b>Python</b>', '</body>']
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
1

Based on the answers by other people, I made this.

It isn't as clean as I had wanted, but it seems to work. I had originally wanted to not re-add the characters after split.

Here, I got rid of one extra argument by combining the two characters into a string. Anyways,

def split_between(string, chars):
    if len(chars) is not 2: raise IndexError("Argument chars must contain two characters.")

    result_list = [chars[1] + line + chars[0] for line in string.split(chars)]

    result_list[0] = result_list[0][1:]
    result_list[-1] = result_list[-1][:-1]

    return result_list

Credit goes to @cforemanand @Ajax1234.

Jacob Birkett
  • 1,927
  • 3
  • 24
  • 49
0

Or even simpler, this:

input = "<head><title>Example Title</title></head>"
print(['<'+elem if elem[0]!='<' else elem for elem in [elem+'>' if elem[-1]!='>' else elem for elem in input.split('><') ]])
whackamadoodle3000
  • 6,684
  • 4
  • 27
  • 44