1

I'm trying to build a simple scrape for the following URL:

http://www.bizjournals.com/milwaukee/datacenter/project-watch-what-is-being-built-in-milwaukee.html

What I'm trying to do is build a spreadsheet of all of the projects listed on the map by:

  • Project Title
  • Project Media (an image link is fine)
  • The project description

I've tried the following code, but I keep returning no data (AKA []) even if I specifically look for one value:

from lxml import html
import requests
page = requests.get('http://www.bizjournals.com/milwaukee/datacenter/project-watch-what-is-being-built-in-milwaukee.html')
tree = html.fromstring(page.content)
#This will create a list of project titles:
project = tree.xpath('//*[@id="m4n-0552-popup-1"]/div[2]/b')
print('Projects:', project)

I'm guessing the issue is that the every time I load the page the ID changes (i.e. 0552 changes to a different 4 digit value).

Any suggestions?

Josh H.
  • 21
  • 4
  • Did you try inspecting page.content? Print it out and see if the projects are there at all. – slider Jun 13 '16 at 20:54
  • I did. All of the projects are there so it does seem like it should be possible. However, per PyNEwbie's answer below it looks like scraping is against their TOS. – Josh H. Jun 13 '16 at 21:58
  • another way around the issue of the 4 digit ID changing, is to download the file and extract the data from the local copy. – Sweet Burlap Jun 15 '16 at 00:35
  • @SweetBurlap Thanks! That makes a ton of sense! – Josh H. Jun 16 '16 at 19:02

2 Answers2

3

They think you are a bot and are not allowing you to pull the content

Key lesson here - when you don't get what you expect, inspect what you get.

To get the text below I just took the content and printed it.

>>>import requests
>>>page = requests.get('http://www.bizjournals.com/milwaukee/datacenter/project-watch-what-is-being-built-in-milwaukee.html')
page_content = requests.content
>>>len(page_content) # here I am just trying to make sure I am not going to cause IDLE to freeze if page_content is unreasonably large
4319  # so the string that is the content is 4319 characters I am going to print 200 characters  (the rest is below)
>>>print page_content[0:200]
<!DOCTYPE html>
<html>






<head>
<title>Pardon Our Interruption</title>
<link rel="stylesheet" type="text/css"    href="//cdn.distilnetworks.com/css/distil.css" media="all">

I have been inspecting the source to try to figure out where the values placed on the map are coming from. I think the data is JSON but still can't seem to identify how these are called and delivered to the browser. I think you are going to have to define some headers to send with your request.

I have tried a few but have not yet been successful.

See this SO Question How to use Python requests to fake a browser visit?

However, I did read their Rules of USAGE and they prohibit scraping of their content. See this link http://acbj.com/privacy#V2.

copy, harvest, crawl, index, scrape, spider, mine, gather, extract, compile, obtain, aggregate, capture, or store any Content, including without limitation photos, images, text, music, audio, videos, podcasts, data, software, source or object code, algorithms, statistics, analysis, formulas, indexes, registries, repositories, or any other information available on or through the Service, including by an automated or manual process or otherwise, if we have taken steps to forbid, prohibit, or prevent you from doing so;

I think I was on track to a way to get the data but stopped after reading the link above.

'\n\n\n \n \n \n \n\n\n Pardon Our Interruption\n \n \n \n \n \n \n \n \n \n\n\n\n \n \n \n 0\n \n \n

Pardon Our Interruption...

\n

\n As you were browsing http://www.bizjournals.com something about your browser made us think you were a bot. There are a few reasons this might happen:\n

\n
    \n
  • You\'re a power user moving through this website with super-human speed.
  • \n
  • You\'ve disabled JavaScript in your web browser.
  • \n
  • A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this http://ds.tl/help-third-party-plugins\' target=\'_blank\'>support article.
  • \n
\n

\n To request an unblock, please fill out the form below and we will review it as soon as possible.\n

\n\n Ignore: Ignore: Ignore: Ignore: \n \n First Name\n
Community
  • 1
  • 1
PyNEwbie
  • 4,882
  • 4
  • 38
  • 86
  • Well there you go! I'm very new when it comes to python (and anything beyond CSS / HTML) so I have NO IDEA what I'm doing. Thanks a lot for looking into this! – Josh H. Jun 13 '16 at 21:56
-1

As aforementioned, they apparently forbid scraping in their TOS. But might be happy if you ask them about your use case etc.

For the academic interest - all the map data is coming from https://online.maps4news.com/ia/2.0/?id=351%2FBE9%2FB24870F6A499C237B88CB54F27 . You can see it load in chrome developer tools > network > xhr its a json response with the popup boxes content and map points

Sweet Burlap
  • 346
  • 3
  • 9