-1

I'm very new at this - my first time writing any kind of web related script. I'm trying to create a script that submits a variable URL in browser then reads data from a specific DOM element of the resulting page.

Basically, I have a huge list of words. I want to automate the process of going to URLs that end in each word (ex: if my list were ['apple','banana','carrot'], and my base URL was www.example.com, I want to go to www.example.com/apple, www.example.com/banana, www.example.com/carrot). Then, at each page, I know the specific DOM element that I want to read data from and then return it back to me.

How would I go about doing this? Any pointers in the right direction would be great! Thanks in advance :)

  • In what language? Also, the idea of Stack Overflow is that you do research yourself and try making it work, and when you run into problems, ask those as questions. We don't really like "I need X" "questions". – Jasper Jul 31 '14 at 07:30
  • You can take a look at bash scripting and curl or wget to get web pages content. Then you can use regexp for retrieving dom elements ... It's a proposition :) – Ko2r Jul 31 '14 at 07:31
  • @Ko2r ‘use regexp for retrieving dom elements’ sounds like a [bad idea](http://stackoverflow.com/a/1732454/418066)! – Biffen Jul 31 '14 at 07:34
  • @Biffen That's true but in some cases it can be sufficient ... Maybe python html parser can be a good solution ! – Ko2r Jul 31 '14 at 07:41

1 Answers1

0

I'd suggest using Python, using the urllib2 library to fetch HTML pages and then using the LXML library to parse them. Then extracting the content of a specific known DOM element is as simple as:

import lxml.html
from lxml import etree
import urllib2
response = urllib2.urlopen('http://example.com/abc/123')
html_text = response.read()
parsed = lxml.html.document_fromstring(html_text)
result = parsed.xpath('/html/body/some/element/path')
print result.text

For other types of data extraction (attributes, etc) see the LXML documentation; it's pretty easy to use.

rmunn
  • 34,942
  • 10
  • 74
  • 105