2

I'm trying to extract information from a large webpage using Beautiful Soup 4. The information I want is contained within one particular div, which I can extract without problem:

passage = soup.find("div", class_="desired_div")

I then want to add tags before the extracted part of the tree - e.g. I want to wrap the extracted div with another div - in preparation for outputing the extracted info as another html file.

With BS4, how do I insert tags before the extracted portion of the parse tree, or wrap the extracted portion of the parse tree? BS4 seems to only allow me to operate on the children of the extracted div (as per the documentation), but I want to insert before or wrap the extracted div.

Westerley
  • 591
  • 1
  • 6
  • 11

2 Answers2

2

BeautifulSoup is intended to extract the content out of a HTML file. It is not intended to build HTML elements. There is however another library Karrigell that can be used to achieve what you are trying to do.

Related answers on StackOverflow:

EDIT: BeautifulSoup 4.2.1 supports creating new tags and adding them into the HTML. BeautifulSoup.new_tag() creates a new HTML tag and insert_before() and insert_after() allow you add them before or after certain elements.

Lucas Siqueira
  • 765
  • 7
  • 19
shaktimaan
  • 11,962
  • 2
  • 29
  • 33
  • I can certainly see not writing an HTML page from scratch using BS. However, when the page being created is just adding/wrapping few tags before the html code I've extracted using BS, and BS HAS the capability, surely its a bit overkill to use Karrigell? – Westerley Mar 07 '14 at 06:06
  • Yup, Just checked and BS has routines `.new_tag(), insert_before()` and `insert_after()` to make that happen. Edited my answer to reflect that. – shaktimaan Mar 07 '14 at 23:21
0

In case anyone else is looking for a solution, here's what I ended up doing.

First, find the div of interest:

tag = soup.find("div", class_="desired_div")

Next, wrap another 'placeholder' div around the div of interest

newtag = soup.new_tag("div")
newtag['class'] = "placeholder"
tag.wrap(newtag)

THEN extract the placeholder div:

passage = soup.find("div", class_="placeholder").extract()

The div of interest is now a child of the extracted portion of the parse tree, and so ready for adding tags before or wrapping tags around.

I'm certainly open to a better solution, but this does seem to work.

Westerley
  • 591
  • 1
  • 6
  • 11