-2

I used an package in python named "pdfminer" to convert pdf file to html file. I want to scrape useful information on the pdf file. How could I use xpath and beautiful on any html file. I know how to use xpath and beautiful soup on the webpage given links like this:

# get tree
def get_tree(url):
    r = requests.get(url)
    tree = html.fromstring(r.content)
    return tree

# get soup
def get_soup(url):
    r = requests.get(url)
    data = r.text
    soup = BeautifulSoup(data)
    return soup

Could anyone give me some example on how to use xpath and beautiful soup if only html file is given? Thanks

  • You cannot use an `xpath` using `BeautifulSoup` html parser. Consider using [`CSS Selectors`](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors) instead. Aside from that, your question is too broad. Please try to make it more specific. – alecxe Nov 12 '14 at 22:22
  • @alecxe I think my question is clear and specific. I just found a way to use BeautifulSoup, but no idea on xpath. There must be some way to use xpath. – f4fc2791e4473eb2ba41b5ddb445b2 Nov 12 '14 at 22:28
  • By specific I mean, it would be much better if you provide an HTML code and note the data you are trying to get from it. CSS selectors are very powerful and can be fully used in place of an xpath. – alecxe Nov 12 '14 at 22:35
  • And, besides, if you are asking about just `xpath` and `BeautifulSoup` - this is a duplicate of http://stackoverflow.com/questions/11465555/can-we-use-xpath-with-beautifulsoup. – alecxe Nov 12 '14 at 22:40
  • Sure, elementtree supports xpath, but what's the point to use both html parsers: BeautifulSoup and lxml? – alecxe Nov 13 '14 at 03:30

1 Answers1

0

Eventually, I found the solution by digging into API and just googling. Here's how you can get soup or tree before you use beautifulsoup and xpath by only given html file as input:

soup = BeautifulSoup(open("output.html"))
doc = open("output.html", "r").read()
tree = etree.HTML(doc)

Then you can play with soup or tree to scrape the content you need from html file.