Data Scraping / Regex expression Error (python)

Question

I'm trying to scrape data from a table in a website. I can pull the data in, in the form of source code. But in my program, I get the error: TypeError: replace_with() takes exactly 2 arguments (3 given)

import urllib2
import bs4
import re

page_content = ""
for i in range(1,11):
    page = urllib2.urlopen("http://b3.caspio.com/dp.asp?appSession=627360156923294&RecordID=&PageID=2&PrevPageID=2&CPIpage="+str(i)+"&CPISortType=&CPIorderBy=")
    page_content += page.read()

soup =  bs4.BeautifulSoup(page_content)
tables = soup.find_all('tr')




file = open('crime_data.csv', 'w+')

for i in tables:
    i = i.replace_with('</td>' , (',')) # this is where I get the error
    i = re.sub(r'<.?td[^>]*>','',i)
    file.write(i + '\n')

Why is it giving me that error?

Also, in essence, I'm trying take the data from the table and put it into a csv file. Any and all help would be greatly appreciated!

yes, when you import re, you can do 'dir(i)' and you can see there is a 'replace_with' function associated with i. — Josh, Mar 26 '15 at 13:42

score 0 · Answer 1 · edited May 23 '17 at 11:56

That replace_with function does not do what it appears you want it to. The linked docs state that: PageElement.replace_with()removes a tag or string from the tree, and replaces it with the tag or string of your choice

From your code it looks more like you want to replace the whole end tag </td> with a , in a an effort to get some sort of comma separated data.

Perhaps you should instead just use the get_text method on your <td> elements, and format them from there:

for i in tables:
    file.write(i.get_text(',').strip() + '\n')
file.close() ####### <----- VERY IMPORTANT TO CLOSE FILES

Note

I tested your code out and you are not really scraping what you are after. I played around with it and came up with this:

import urllib2
import bs4

def scrape_crimes(html,write_headers):
    soup =  bs4.BeautifulSoup(html)                               # make the soup
    table = soup.find_all('table',class_=('cbResultSetTable',))   # search for the exact table you want, there are multiple nested tables on the pages you are scraping
    if len(table) > 0:                                            # if the table is found
        table = table[0]                                          # set the table to the first result
    else:
        return                                                    # no table found, no use scraping
    with open('crime_data.csv', 'a') as f:                        # opens file to append content
        trs = table.find_all('tr')                                # get all the rows in the table
        if write_headers:                                         # if we request that headers are written
            for th in trs[0].find_all('th'):                      # write each header followed by a comma
                f.write(th.get_text(strip=True).encode('utf-8')+',')  # ensure data is writable by calling encode
            f.write('\n')                                         # write a newline
        for tr in trs:                                            # for each table row in the table
            tds = tr.find_all('td')                               # get all the td elements
            if len(tds) > 0:                                      # if there are td elements (not true for header rows
                for td in tds:                                    # for each td element
                    f.write(td.get_text(strip=True).encode('utf-8')+',') # add the data followed by a comma
                f.write('\n')                                    # finish the row off with a newline

open('crime_data.csv','w').close()                               # clear the file before running

for i in range(1,11):
    page = urllib2.urlopen("http://b3.caspio.com/dp.asp?appSession=627360156923294&RecordID=&PageID=2&PrevPageID=2&CPIpage="+str(i)+"&CPISortType=&CPIorderBy=")
    scrape_crimes(page.read(),i==1)                              # offset the code to a function, the second argument is only true the first time
                                                                 # this ensures that you will get headers only at the top of your output file

I removed the use of the re library because in general regex and html do not play nicely together., the short explanation being: HTML is not a regular language.

I also witch from using the coding pattern:

file = open('file_name','w')
# do stuff
file.close()

to this preferred pattern:

with open('file_name','w') as f:
    # do stuff

In the first example it is often common to forget to close the file, which you did forget in your provided code. The second pattern will handle the close for you, so no worries there. Also, it is not good practice to name your variables with the same names as native python commands.

I changed your scripts pattern from combining all the pages html to scraping each page one by one because that is not a good idea. You could run into memory issues if you were doing this with large pages. Instead, it is usually better to handle the data in chunks.

The next thing I did was look at the HTML of the page you were scraping. You were pulling all <tr> elements but had you closely inspected the page, you would have seen that the table you are after is actually contained in a <tr>, giving you some big nasty block of text as a "result". Using bs4's optional classs_ argument to denote a specific class to look for in the table element leads to the data you are after.

The next thing I noticed was that the table headers would get pulled for every page, sprinkling your results with this redundant information. You would only want to pull this info the first time, so I added some logic for that.

I switched to using the .get_text method instead of the regex/replace_with combo you had because of the above explanations. The get_text method returns unicode however so I added the call to .encode('utf-8') which would ensure the data would be writable. I also specified the strip=True argument to get rid of any pesky white-space characters on the data. The reasoning behind this: you load the whole bs4 library, why not use it? The good people who write that library spent a lot of time taking care of parsing the text so you don't have to waste time doing it.

Hope this was helpful! Happy scraping!

Data Scraping / Regex expression Error (python)

1 Answers1