extract text from website source code

Question

I want to extract info from an website link:

http://www.website.com

There is a string that appears few times: "STRING TO CAPTURE", but I want to capture the FIRST time appears. It will be inside the following structure:

<td width="10%" bgcolor="#FFFFFF"><font class="bodytext9">1-Jun-2013</font></td>
<td width="4%" bgcolor="#FFFFFF" align=center><font class="bodytext9">Sat</font></td>
<td width="4%" bgcolor="#FFFFFF" align="center"><font class="bodytext9">TIME</font></td>
<td width="15%" bgcolor="#FFFFFF" align="center"><a class="black_9" href="link1">Some Text here</a></td>
<td width="5%" bgcolor="#FFFFFF" align="center"><font class="bodytext9"><img src="img/colors/pink.gif"></font></td>
<td width="5%" bgcolor="#FFFFFF" align="center"></td>
<td width="5%" bgcolor="#FFFFFF" align="center"><font class="bodytext9">Another Text</font></td>
<td width="5%" bgcolor="#FFFFFF" align="center"></td>
<td width="5%" bgcolor="#FFFFFF" align="center"><font class="bodytext9"><img src="img/colors/white.gif"></font></td>
<td width="15%" bgcolor="#FFFFFF" align="center"><a class="black_9" href="link2">Here is also Text</a></td>
<td width="15%" bgcolor="#FFFFFF" align="center"><a href="LINKtoWeb" class=list><u>STRING TO CAPTURE</u></a></td>
<td width="4%" bgcolor="#FFFFFF" align="center"><a target="_new" href="AnotherLink"><img src="img/img2.gif" border="0"></a></td>
</tr>

This is a fix format, where between the is 12 lines start with and all other tags; I want to extract the text in each line, eg.

1-Jun-2013
Sat
TIME
Some Text here
...
STRING TO CAPTURE

and I also want to extract the link at line contain "STRING TO CAPTURE" which is:

LINKtoWeb

In my opinion, python could be very functional to do this task, but I also too new to python to get it works, hope python experts here can show me how. I have no idea where to start, search around and find this could be solution:

use YAML;
my $data = Load(http://www.website.com);
say $data->{"<tr>"}->{"<td>"}->{"STRING TO CAPTURE"};

But I don't know how to deal with all the texts in these 12 lines ?

I need to do this process on my server, when they load the website, can the tools you suggested be used for that purpose, how is the steps ? — user1314404, May 30 '13 at 06:21

score 1 · Accepted Answer · edited May 23 '17 at 10:25

1

Download and Install BeautifulSoup then

html = urllib.urlopen('http://www.website.com').read()
soup = BeautifulSoup.BeautifulSoup(html)
texts = soup.findAll(text=True)

def get_stuff(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

visible_texts = filter(get_stuff, texts)

source - BeautifulSoup Grab Visible Webpage Text

edited May 23 '17 at 10:25

Community

1
1

answered May 30 '13 at 06:25

Srikar Appalaraju

71,928
54
216
264

So for my server and website, I need to install it some where ? – user1314404 May 30 '13 at 06:42
install this python package in the machine where you will be running the crawling python script. `import BeautifulSoup` should work without error... – Srikar Appalaraju May 30 '13 at 07:07
``from bs4 import BeautifulSoup`` BeautifulSoup is provided through a package called bs4, providing some other functionalities, among them ``UnicodeDammit``. – Balthazar Rouberol May 30 '13 at 07:20
My server is support Python 2.7 (they installed in all servers). Is it ok for Beautifulsoup to run ? I need to copy the BeautifulSoup to where then can use "import BeautifulSoup" in my code ? Sorry for my stupidity – user1314404 May 30 '13 at 07:26
yes it should work. please follow these instructions for installing - http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup – Srikar Appalaraju May 30 '13 at 08:05
This doesn't show where to put this on server. I spent sometime read it and search around but no answer... – user1314404 May 30 '13 at 09:37

extract text from website source code

1 Answers1

Linked