0

There are many posts here that ask about how to do automated searches on Google. I chose to use BeautifulSoup, and read many of the questions asked about it here. I couldn't find a direct answer to my question, although the specific task seems pretty commonplace. My code below is pretty self-explanatory, the bracketed sections are where I ran into trouble (EDIT By "ran into trouble" I mean that I couldn't figure out how to implement my pseudocode for this portion, and after reading the documentation and searching for a similar question online with code, I still didn't know how to do it). If it helps, I think my problem is probably pretty similar to anyone doing an automated search on PubMed to find specific articles of interest. Thanks very much.

#Find Description

import BeautifulSoup
import csv
import urllib
import urllib2

input_csv = "Company.csv"
output_csv = "output.csv"

def main():
    with open(input_csv, "rb") as infile:
        input_fields = ("Name")
        reader = csv.DictReader(infile, fieldnames = input_fields)
        with open(output_csv, "wb") as outfile:
            output_fields = ("Name", "Description")
            writer = csv.DictWriter(outfile, fieldnames = output_fields)
            writer.writerow(dict((h,h) for h in output_fields))
            next(reader)
            first_row = next(reader)
            for next_row in reader:
                search_term = first_row["Name"]
                url = "http://google.com/search?q=%s" % urllib.quote_plus(search_term)

                #STEP ONE: Enter "search term" into Google Search
                #req = urllib2.Request(url, None, {'User-Agent':'Google Chrome'} )
                #res = urllib2.urlopen(req)
                #dat = res.read()
                #res.close()
                #BeautifulSoup(dat)


                #STEP TWO: Find Description
                #if there is a wikipedia page for the entity:
                    #return first sentence of wikipedia page
                #if other site:
                    #return all sentences that have the keyword "keyword" in them

                #STEP THREE: Return Description as "google_search" variable

                first_row["Company_Description"] = google_search
                writer.writerow(first_row)
                first_row = next_row

if __name__ == "__main__":
    main()

ADDENDUM

For anyone working on this or looking at it, I came up with a suboptimal solution that I'm still finishing up. But I thought I would post it in case it helps anyone else who comes to this page. Basically, rather than dealing with the issue of finding which webpage to select, I just did an initial step which does all the searches in Wikipedia. It's not what I want, but at least it will make it easier to get a subset of the entities. The code is in two files (Wikipedia.py and wiki_test.py):

#Wikipedia.py

from BeautifulSoup import BeautifulSoup
import csv
import urllib
import urllib2
import wiki_test


input_csv = "Name.csv"
output_csv = "WIKIPEDIA.csv"

def main():
    with open(input_csv, "rb") as infile:
        input_fields = ("A", "C", "E", "M", "O", "N", "P", "Y")
        reader = csv.DictReader(infile, fieldnames = input_fields)
        with open(output_csv, "wb") as outfile:
            output_fields = ("A", "C", "E", "M", "O", "N", "P", "Y", "Description")
            writer = csv.DictWriter(outfile, fieldnames = output_fields)
            writer.writerow(dict((h,h) for h in output_fields))
            next(reader)
            first_row = next(reader)
            for next_row in reader:
                print(next_row)
                print(first_row["A"])
                search_term = first_row["A"]
                #print(search_term)
                result = wiki_test.wiki(search_term)
                first_row["Description"] = result
                writer.writerow(first_row)
                first_row = next_row

if __name__ == "__main__":
main()

And a helper module based on this post Extract the first paragraph from a Wikipedia article (Python):

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

def wiki(article):
    article = urllib.quote(article)
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Google Chrome')] #wikipedia needs this
    resource = opener.open("http://en.wikipedia.org/wiki/" + article)
    #try:
    #    urllib2.urlopen(resource)
    #except urllib2.HTTPError, e:
    #    print(e)
    data = resource.read()
    resource.close()
    soup = BeautifulSoup(data)
    print soup.find('div',id="bodyContent").p

I just need to fix it to handle HTTP 404 errors (ie no page found) and this code will work for anyone wanting to find basic company information that are available on wikipedia. Again, I'd rather have something that works on a google search and finds the relevant site and the relevant section of the site mentioning "keyword", but at least this current program gets us something.

Community
  • 1
  • 1
user7186
  • 449
  • 1
  • 4
  • 14
  • watch this http://www.youtube.com/watch?v=52wxGESwQSA&feature=player_detailpage#t=3080s , also what does it mean "ran into trouble"? Do you get an error message, what kind of error is it? Adding this information will save people's time and improves your chance of getting an answer. – root Oct 11 '12 at 19:39
  • Thanks man. The video is super helpful, I'll post with the solution once/if I get it after I finish watching it. – user7186 Oct 11 '12 at 21:09
  • it's ideal if you want to get started with web scraping :) – root Oct 11 '12 at 21:12

0 Answers0