0

I'm trying to write multiple rows in to a CSV file using python and I've been working on this code for a while to piece together how to do this. My goal here is simply to use the oxford dictionary website, and web-scrape the year and words created for each year into a csv file. I want each row to start with the year I'm searching for and then list all the words across horizontally. Then, I want to be able to repeat this for multiple years.

Here's my code so far:

import requests
import re 
import urllib2
import os
import csv

year_search = 1550
subject_search = ['Law'] 

path = '/Applications/Python 3.5/Economic'
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)

user_agent = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
header = {'User-Agent':user_agent}
request = urllib2.Request('http://www.oed.com/', None, header)
f = opener.open(request)  
data = f.read()
f.close()
print 'database first access was successful'

resultPath = os.path.join(path, 'OED_table.csv')
htmlPath = os.path.join(path, 'OED.html')
outputw = open(resultPath, 'w')
outputh = open(htmlPath, 'w')
request = urllib2.Request('http://www.oed.com/search?browseType=sortAlpha&case-insensitive=true&dateFilter='+str(year_search)+'&nearDistance=1&ordered=false&page=1&pageSize=100&scope=ENTRY&sort=entry&subjectClass='+str(subject_search)+'&type=dictionarysearch', None, header)
page = opener.open(request)
urlpage = page.read()
outputh.write(urlpage)
new_word = re.findall(r'<span class=\"hwSect\"><span class=\"hw\">(.*?)</span>', urlpage)
print str(new_word)
outputw.write(str(new_word))
page.close()
outputw.close()

This outputs my string of words that were identified for the year 1550. Then I tried to make code write to a csv file on my computer, which it does, but I want to do two things that I'm messing up here:

  1. I want to be able to insert multiple rows into this and
  2. I want to have the year show up in the first spot

Next part of my code:

with open('OED_table.csv', 'w') as csvfile:
    fieldnames = ['year_search']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    writer.writerow({'year_search': new_word})

I was using the csv module's online documentation as a reference for the second part of the code.

And just to clarify, I included the first part of the code in order to give perspective.

martineau
  • 119,623
  • 25
  • 170
  • 301
Kainesplain
  • 57
  • 1
  • 5
  • Ok, I've probably spent more time on this than I should to try understand where the dictionary was coming from (a Python dictionary, not OED) and what needed to be written. As far as I can tell, your expected output is just a list of `1550 | accomplice` as a row i.e. just a year in column A and a word in column B, for every word in 1550? – roganjosh Oct 09 '16 at 15:55
  • Yes, I wanted it to be written with the year followed be the words, such that they're all in line. Sorry if that didn't make sense. – Kainesplain Oct 09 '16 at 15:58
  • 1
    And do you want to do this for all years in a range? If I understand your request properly, it would be easier to build that into an answer. A lot of your code is unnecessary and you're [using regex to parse html](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). However, it appears to work in this case, so I'll formulate an answer now trying to use your approach – roganjosh Oct 09 '16 at 16:00
  • Also, do you want to do this for a range of years as I might as well factor that in, rather than you doing one year at a time – roganjosh Oct 09 '16 at 16:01
  • Yeah, I wanted to be able to do it for years in a range of 1550 through like 1900, but I didn't know how to make it like that and figured I could input the years seperately. – Kainesplain Oct 09 '16 at 16:02
  • OMG, it's probably a good job that I asked then :P – roganjosh Oct 09 '16 at 16:03
  • Yeah, i'm not a computer student : / . Just trying to make my thesis paper research easier to gather. – Kainesplain Oct 09 '16 at 16:11
  • 1
    You should probably use the [Python 2 documentation](https://docs.python.org/2/library/csv.html#module-csv) for the `csv` module as a reference. – martineau Oct 09 '16 at 16:23
  • I don't understand what's happened. When I wrote my first comments, I was getting a list fine. Now I appear to have been blocked or something as I only get one word... from running the same code – roganjosh Oct 09 '16 at 16:39
  • @martineau am I going mad? I was able to replicate the list just fine before to know that "accomplice" was a word. Now when I run the code (you just need to remove `os.path.join(path, ` for `resultPath` and `htmlPath`) I just get the word of the day. – roganjosh Oct 09 '16 at 16:46
  • 1
    @roganjosh: No you're not crazy. I, too, was getting multiple results for a while but now only one, `['leggiero']`. – martineau Oct 09 '16 at 16:48
  • 1
    @martineau thanks for the confirmation, I've spent ages debugging thinking I did something silly. OP: I don't think this is possible without an account, they appear to require a login after so many requests from the same IP – roganjosh Oct 09 '16 at 16:52

1 Answers1

3

You really shouldn't parse html with a regex. That said, here's how to modify your code to produce a csv file of all the words found.

Note: for unknown reasons the list of result word varies in length from one execution to the next.

import csv
import os
import re
import requests
import urllib2

year_search = 1550
subject_search = ['Law']

path = '/Applications/Python 3.5/Economic'
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)

user_agent = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
header = {'User-Agent':user_agent}

# commented out because not used
#request = urllib2.Request('http://www.oed.com/', None, header)
#f = opener.open(request)
#data = f.read()
#f.close()
#print 'database first access was successful'

resultPath = os.path.join(path, 'OED_table.csv')
htmlPath = os.path.join(path, 'OED.html')
request = urllib2.Request(
    'http://www.oed.com/search?browseType=sortAlpha&case-insensitive=true&dateFilter='
    + str(year_search)
    + '&nearDistance=1&ordered=false&page=1&pageSize=100&scope=ENTRY&sort=entry&subjectClass='
    + str(subject_search)
    + '&type=dictionarysearch', None, header)
page = opener.open(request)

with open(resultPath, 'wb') as outputw, open(htmlPath, 'w') as outputh:
    urlpage = page.read()
    outputh.write(urlpage)

    new_words = re.findall(
        r'<span class=\"hwSect\"><span class=\"hw\">(.*?)</span>', urlpage)
    print new_words
    csv_writer = csv.writer(outputw)
    for word in new_words:
        csv_writer.writerow([year_search, word])

Here's the contents of the OED_table.csv file when it works:

1550,above bounden
1550,accomplice
1550,baton
1550,civilist
1550,garnishment
1550,heredity
1550,maritime
1550,municipal
1550,nil
1550,nuncupate
1550,perjuriously
1550,rank
1550,semi-
1550,torture
1550,unplace
martineau
  • 119,623
  • 25
  • 170
  • 301
  • "leggiero" appears to be the word of the day. If you load the url in a browser, you're met with a login screen. While I don't doubt this is a decent approach written by you, I think OP will hit a roadblock after just a few requests. I don't think they allow scraping at all. – roganjosh Oct 09 '16 at 16:56
  • @roganjosh: All part of the reason I started my answer with a caveat. – martineau Oct 09 '16 at 17:03
  • True, the only reason I commented is because we both get the same word and OP needs to abandon this approach unless there is a login mechanism that is accessible (I didn't check to see if it was a paid subscription). We both ended up pulling a word from the login screen. Upvote anyway since you technically answer the question about writing to csv :) – roganjosh Oct 09 '16 at 17:06
  • @roganjosh: Thanks. If nothing else, the OP can see how to write multiple rows into a cvs file, regardless of the source of the data for them. I too was wondering how it was possible to do queries like this without some sort of oed account and related authorization. – martineau Oct 09 '16 at 17:11
  • @roganjosh , Yes , I'm seeing the word leggiero appear multiple times as well and I think it has to do with needing an account. However, this does answer my question and gives me a better idea as to how to proceed. Thank you very much. – Kainesplain Oct 09 '16 at 17:35
  • @martineau, if i wanted to make it so all the year's words were on the same line, how would i go about doing that? – Kainesplain Oct 09 '16 at 18:01
  • 1
    Kainesplain: You could write them all as one row (without the year) by removing the `for word in new_words:` and making a single call to `csv_writer.writerow(new_words)`. You might need to make it conditional by using `if new_words: csv_writer.writerow(new_words)`. If you want to add the year at the beginning, use `csv_writer.writerow([year_search] + new_words)`. – martineau Oct 09 '16 at 18:30