0

I have a dataframe column containing html.

I'm trying to extract everything between the list tags, (<li> and </li>) and return in a new column named 'output'.

I'd like to include the <li> and </li> tags too as part of the output.

Minimum Reproducible Example:

import pandas as pd

data = {
    'ID': ['1', '2'],
    'Description': ['blah blah <li>Point 1</li>blah blah<li>Point 2</li>blah blah blah', 'blah<li>Point1</li>blah<li>Point 2</li>']}

df = pd.DataFrame(data)
df['new'] = df.Description.apply(lambda st: st[st.find("<li>")+1:st.find("</li>")])
print(df)

Desired Output

  ID                                        Description         output
0  1  blah blah <li>Point 1</li>blah blah<li>Point 2...  <li>Point 1</l><li>Point 2</li>
1  2            blah<li>Point1</li>blah<li>Point 2</li>  <li>Point 1</l><li>Point 2</li>

What I've tried: Although there appears to be a lot of solutions around ('extracting substrings between two strings') there isn't anything that comes close.

For example, this (and many other result) only return the first instance Extract substring between two characters in pandas but I need to extract all instances.

Also I haven't seen any that will preserve the <li> tags.

Lee Roy
  • 297
  • 1
  • 11
  • Depending on the variability of your inputs you may want to look in to an HTML parser and iterate through each record. If there are a non-variable number of `
  • ` tags you could use a regex like this https://regex101.com/r/0gyxMk/1
  • – Simeon Oct 11 '22 at 15:20
  • Thanks very much for taking the time to answer. It looks like I've got some reading to do! Appreciate the help. – Lee Roy Oct 11 '22 at 15:50