-1

Python/Regex: I'm looking for the most elegant way to split up an HTML string to an array of strings where the delimiter is a script tag. So for:

  <p> paragraph one </p>
  <script src="https://something.com/script.js"></script> 
  <p> paragraph two </p>
  <p> paragraph three </p>
  <script src="https://something.com/script.js"/>
  <p> paragraph four </p>

I would get the following array of strings:

[
  '<p> paragraph one </p>',
  '<script src="https://something.com/script.js"></script>',
  '<p> paragraph two </p><p> paragraph three </p>',
  '<script src="https://something.com/script.js"/>',
  '<p> paragraph four </p>'
]

I would appreciate a pointer in the right direction.

JasonGenX
  • 4,952
  • 27
  • 106
  • 198

2 Answers2

0

If you don't want to install external packages, this regex in combination with a split on end-of-line should do the job:

import re
data=re.sub(r'</p>\n.*?<p>', '</p><p>', s).split('\n')

for line in data:
    print(line)

Outputs:

  <p> paragraph one </p>
  <script src="https://something.com/script.js"></script>
  <p> paragraph two </p><p> paragraph three </p>
  <script src="https://something.com/script.js"/>
  <p> paragraph four </p>
Ronald
  • 2,930
  • 2
  • 7
  • 18
0

As Ronald shows you can to some degree manipulate html using regex even if it is usually not a good idea, but you wanted the script-tags to be delimiters, right? And you wanted the delimiters to be included in the output.

Capturing both styles \<script.*\</script\> and \<script.*/\> with an | in a group should do the trick.

Full code (python3):

import re

text = '''
<p> paragraph one </p>
<script src="https://something.com/script.js"></script>
<p> paragraph two </p>
<p> paragraph three </p>
<script src="https://something.com/script.js"/>
<p> paragraph four </p>
'''

regex = '(\<script.*\</script\>|\<script.*/\>)'
m = re.split(regex, text.replace("\n", ""))
print(m)

outputs:

['<p> paragraph one </p>', '<script src="https://something.com/script.js"></script>', '<p> paragraph two </p><p> paragraph three </p>', '<script src="https://something.com/script.js"/>', '<p> paragraph four </p>']
toftis
  • 1,070
  • 9
  • 26