using path and beautiful soup on html file converted from pdf

Question

I used an package in python named "pdfminer" to convert pdf file to html file. I want to scrape useful information on the pdf file. How could I use xpath and beautiful on any html file. I know how to use xpath and beautiful soup on the webpage given links like this:

# get tree
def get_tree(url):
    r = requests.get(url)
    tree = html.fromstring(r.content)
    return tree

# get soup
def get_soup(url):
    r = requests.get(url)
    data = r.text
    soup = BeautifulSoup(data)
    return soup

Could anyone give me some example on how to use xpath and beautiful soup if only html file is given? Thanks

You cannot use an `xpath` using `BeautifulSoup` html parser. Consider using [`CSS Selectors`](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors) instead. Aside from that, your question is too broad. Please try to make it more specific. — alecxe, Nov 12 '14 at 22:22
@alecxe I think my question is clear and specific. I just found a way to use BeautifulSoup, but no idea on xpath. There must be some way to use xpath. — f4fc2791e4473eb2ba41b5ddb445b2, Nov 12 '14 at 22:28
By specific I mean, it would be much better if you provide an HTML code and note the data you are trying to get from it. CSS selectors are very powerful and can be fully used in place of an xpath. — alecxe, Nov 12 '14 at 22:35
And, besides, if you are asking about just `xpath` and `BeautifulSoup` - this is a duplicate of http://stackoverflow.com/questions/11465555/can-we-use-xpath-with-beautifulsoup. — alecxe, Nov 12 '14 at 22:40
Sure, elementtree supports xpath, but what's the point to use both html parsers: BeautifulSoup and lxml? — alecxe, Nov 13 '14 at 03:30

score 0 · Accepted Answer · answered Nov 13 '14 at 03:13

Eventually, I found the solution by digging into API and just googling. Here's how you can get soup or tree before you use beautifulsoup and xpath by only given html file as input:

soup = BeautifulSoup(open("output.html"))
doc = open("output.html", "r").read()
tree = etree.HTML(doc)

Then you can play with soup or tree to scrape the content you need from html file.

using path and beautiful soup on html file converted from pdf

1 Answers1