1

In my code I'm trying to get the first line of text from a webpage into a variable in python. At the moment I'm using urlopen to get the whole page for each link I want to read. How do I only read the first line of words on the webpage.

My code:

import urllib2
line_number = 10
id = (np.arange(1,5))
for n in id:
    link =  urllib2.urlopen("http://www.cv.edu/id={}".format(n))
    l = link.read()

I want to extract the word "old car" from the following html code of the webpage:

<html>
    <head>
        <link rel="stylesheet">
        <style>
            .norm { font-family: arial; font-size: 8.5pt; color: #000000; text-decoration : none; }
            .norm:Visited { font-family: arial; font-size: 8.5pt; color: #000000; text-decoration : none; }
            .norm:Hover { font-family: arial; font-size: 8.5pt; color : #000000; text-decoration : underline; }
        </style>
    </head>
    <body>
<b>Old car</b><br>
<sup>13</sup>CO <font color="red">v = 0</font><br>
ID: 02910<br>
<p>
<p><b>CDS</b></p>
h_user
  • 113
  • 1
  • 5
  • Will the first line always be inside `` tag? – Anand S Kumar Jul 06 '15 at 14:24
  • Like Anand is alluding to, if the first line is always in a `` tag then you can use the buildt in regex library for Python, `re` to grab whatever is between the `` tags – amza Jul 06 '15 at 14:33
  • It's not very clear what you need. Do you want to extract the word "old car" from this web page or you want to know how to extract the first line of words on any webpage? – Joe T. Boka Jul 06 '15 at 15:00
  • Yes it will always be in a tag but there may be other things later on in the webpage inside a tag that I don't want. In this example there the word I need to extract is old car but on other webpages the exact words will be different but be in the same location in the html code each time. – h_user Jul 06 '15 at 15:16
  • What do you mean by "same location"? Because you can always expand your regex to take anything after `\n` for example. If you regex based on everything in between just the `` tags you could also only select the first element in the list it returns. You can also use [string.find](https://docs.python.org/2/library/string.html#string.find) which will return the **first** index of where the `` and `` tags are then just table the string between their two results. – amza Jul 06 '15 at 15:39

2 Answers2

0

If you are going to do this on many different webpages that might be written differently, you might find that BeautifulSoup is helpful.

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

As you can see at the bottom of quick start, it should be possible for you to extract all the text from the page and then take whatever line you are interested in.

Keep in mind that this will only work for HTML text. Some webpages use javascript extensively, and requests/BeautifulSoup will not be able to read content provided by the javascript.

Using Requests and BeautifulSoup - Python returns tag with no text

See also an issue I have had in the past, which was clarified by user avi: Want to pull a journal title from an RCSB Page using python & BeautifulSoup

Community
  • 1
  • 1
0

Use XPath. It's exactly what we need.

XPath, the XML Path Language, is a query language for selecting nodes from an XML document.

The lxml python library will help us with this. It's one of many. Libxml2, Element Tree, and PyXML are some of the options. There are many, many, many libraries to do this type of thing.

Using XPath

Something like the following, based on your existing code, will work:

import urllib2
from lxml import html
line_number = 10
id = (np.arange(1,5))
for n in id:
    link =  urllib2.urlopen("http://www.cv.edu/id={}".format(n))
    l = link.read()
    tree = html.fromstring(l)
    print tree.xpath("//b/text()")[0]

The XPath query //b/text() basically says "get the text from the <b> elements on a page. The tree.xpath function call returns a list, and we select the first one using [0]. Easy.

An aside about Requests

The Requests library is the state-of-the-art when it comes to reading webpages in code. It may save you some headaches later.

The complete program might look like this:

from lxml import html
import requests

for nn in range(1, 6):
    page = requests.get("http://www.cv.edu/id=%d" % nn)
    tree = html.fromstring(page.text)
    print tree.xpath("//b/text()")[0]

Caveats

The urls didn't work for me, so you might have to tinker a bit. The concept is sound, though.

Reading from the webpages aside, you can use the following to test the XPath:

from lxml import html

tree = html.fromstring("""<html>
    <head>
        <link rel="stylesheet">
    </head>
    <body>
<b>Old car</b><br>
<sup>13</sup>CO <font color="red">v = 0</font><br>
ID: 02910<br>
<p>
<p><b>CDS</b></p>""")

print tree.xpath("//b/text()")[0] # "Old cars"
Ezra
  • 7,552
  • 1
  • 24
  • 28