Extract only a portion of elements from Wiktionary using bs4

Question

I have managed to get only the h2 and h3 tags printed out.. But I want every element from first h2 to the second h2 tag(data in those tags is only related to english).Like in this picture Then, I would like to check that data for categories like noun, verb and if they exist - print them out. Got stuck really hard here. This is what I've written so far

url = 'https://en.wiktionary.org/wiki/dog'
r = requests.get(url,headers={'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64)'})
data = r.text
soup = BeautifulSoup(data)

content = soup.find_all('span',{'class':'mw-headline'})
for item in content:
    print item.text

what about [wikimedia/pywikibot-wiktionary](https://github.com/wikimedia/pywikibot-wiktionary/blob/master/wiktionarypage.py) from github — rebeling, Oct 10 '15 at 20:18
Uhmm, no. It is said in the project that I have to do it with the bs — Eso Teric, Oct 10 '15 at 22:59
Possible duplicate of [Has anyone parsed Wiktionary?](http://stackoverflow.com/questions/3364279/has-anyone-parsed-wiktionary) — Nemo, Feb 13 '16 at 19:04

tlastowka · Accepted Answer · 2015-10-10T20:04:41.510

0

You might want to use lxml.etree for this because it will let you use xpath expressions which are perfect for this sort of thing. bs4 and etree are often used together in the same application, bs4 for the stuff bs4 makes easy, and etree for the stuff that benefits from xpath.

Here's an example of how to select the elements you want using etree and xpath. You can tweak it to pull the data you want from each element.

import requests
from lxml import etree

url = """https://en.wiktionary.org/wiki/dog"""
r = requests.get(url)
h = etree.HTMLParser()

tree = etree.fromstring(r.text,h)

xp = """//div[h2[span[@id='English']]]/*[count(preceding-sibling::h2)=1]"""
elements = tree.xpath(xp)

for e in elements:
    inner = e.xpath("""span[@class='mw-headline']""")
    for i in inner:
        print(i.text)

Getting started with xpath can be a high hurdle, but its well worth the effort for all the problems it solves once you wrap your head around it. There is a plugin for firebug called "firepath" that lets you inspect an element and get one possible xpath expression for it, and try random xpath expressions against the page you are visiting. Its a big help for learning and debugging. https://addons.mozilla.org/en-US/firefox/addon/firepath/

edited Oct 10 '15 at 20:04

answered Oct 10 '15 at 19:59

tlastowka

702
5
14

Many thanks for this, it really works but my project says I have to do it with beautifulsoup. Is that even possible? I am starting to think that wiktionary has some terrible organised tags – Eso Teric Oct 10 '15 at 21:09
I'm sure it is possible using some combination of bs4 and the base functionality of python. You just have to figure out a strategy to navigate their schema and that you can walk it like any tree of nested lists and hashes. Since you've already figured out how to navigate to the right general area of the doc in bs4, you could recursively scan through the entire tree looking for whatever element you want. That's what I used to do before I learned xpath :) – tlastowka Oct 10 '15 at 22:12
Hi,I know this isn't subject related but how listened to your advice and started learning lxml. However I can't solve this one> I don't know how to extract text that's inside a tag but after tag, like this:
Some text here TEXT_I_NEED

Eso Teric

Oct 13 '15 at 23:51

assuming im reading it right in the comment window, with just what you have there, a few things would work. The 2 quickest example that come to mind are //b/text(), //li[@id='list]/b[1]/text(). These read "the test of every b element in the document" and "the text of the first b in every li with attribute id = list in the document." I'd recommend loading the document in firebug. It will give you at least one working example you can tweak. if its in a bigger document, of course, those might change. In english, these 2 read "the text of every 'b' element in the document' and t – tlastowka Oct 14 '15 at 01:53

Always remember that you can also break things up or filter with normal python, and you can switch back and forth, navigating some parts using xpath and other parts using other navigator functions in lxml and bs4. for example, you could do //b to get all the bs, then loop over the elements in the resulting python list printing e.text from each element. – tlastowka Oct 14 '15 at 01:57

actually i realize now im reading your xml wrong. try //li[@id='list']/text(). if that doesnt work, try //li[@id='list] and then grab .text from the resulting element object. If its more complicated than that, post a new question with formatted xml/html – tlastowka Oct 14 '15 at 02:01

Extract only a portion of elements from Wiktionary using bs4

1 Answers1