2

I am trying to web scrape from "https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq". Specifically,under the div class = "socrata-table frozen-columns", all of the data-column name & data-column description. However, the code that I've written doesn't seem to be working(its not returning anything?)

import requests
from bs4 import BeautifulSoup
url = "https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq"
page = requests.get(url)
print(page.status_code)
soup=BeautifulSoup(page.content,'html.parser')


for col in soup.find_all("div", attrs={"class":"socrata-visualization-container loaded"})[0:1]:
   for tr in col.find_all("div",attrs={"class":"socrata-table frozen-columns"}):
      for data in tr.find_all("div",attrs={"class":"column-header-content"}):
        print(data.text)

is my code wrong?

judebox
  • 59
  • 2
  • 12

3 Answers3

2

The page is loaded dynamically and the data set is paged which would mean using browser automation to retrieve, which is slow. There is an API you can use. It has arguments which will allow you to return results in batches..

Read the API documentation here. This is going to be a much more efficient and reliable way of retrieving the data.

Use limit to determine # records retrieved at a time; use offset parameter to start next batch for new records. Example call here.

As it is a query you can actually tailor the other parameters as you would a SQL query to retrieve the desired result set. This also means you can probably write a very quick initial query to return the record count from the database which you can use to determine your end point for batch requests.

You could write a class based script that uses multiprocessing and grab these batches more efficiently.

import requests
import pandas as pd
from pandas.io.json import json_normalize

response  = requests.get('https://data.lacity.org/api/id/y8tr-7khq.json?$select=`dr_no`,`date_rptd`,`date_occ`,`time_occ`,`area_id`,`area_name`,`rpt_dist_no`,`crm_cd`,`crm_cd_desc`,`mocodes`,`vict_age`,`vict_sex`,`vict_descent`,`premis_cd`,`premis_desc`,`weapon_used_cd`,`weapon_desc`,`status`,`status_desc`,`crm_cd_1`,`crm_cd_2`,`crm_cd_3`,`crm_cd_4`,`location`,`cross_street`,`location_1`&$order=`date_occ`+DESC&$limit=100&$offset=0')
data = response.json()
data = json_normalize(data)
df = pd.DataFrame(data)
print(df)

Example record in JSON response:

enter image description here

QHarr
  • 83,427
  • 12
  • 54
  • 101
  • oh whoa this is so much easier, I think I would change to using the api thank you so much.may I ask how does the offset parameter work? the dataset has over 1.8mil rows, if I limit each batch to 100,000, it means offset = 18? – judebox Nov 25 '18 at 18:57
  • thanks again. I've accepted your answer even though it wasn't a direct answer to my original question. I will read up on the API and I think I understood how does the offset works (i.e the starting point). Unfortunately, I am still incapable of writing a class based script. – judebox Nov 25 '18 at 19:14
  • Check the documentation. My guess is it specifies the start position for record retrieval. So, if you have retrieved 1000 records in first run, you specify 1001 to start at next record. You should very this but that is often the logic. – QHarr Nov 25 '18 at 19:14
  • you can always write a loop to process in batches. You can loop over list (of lists?) that contains the various parameters you insert into your main request URL. I will update answer with what I should have specified re dynamic loading. – QHarr Nov 25 '18 at 19:16
  • 1
    Yeap offset its the starting point for the next batch. thanks again! – judebox Nov 25 '18 at 19:24
0

If you look into page source (ctrl + U), you'll notice that there is no such element as <div class = "socrata-table frozen-columns">. It's because content you want to scrap is added dynamically to the page. Check out this question: web scraping dynamic content with python or Web scraping a website with dynamic javascript content

Adrian
  • 105
  • 2
  • 6
0

This is because data is dynamically filled by ReactJs after page load.

If you download it via requests you can't see the data.

You need to use selenium web driver, open page and process all the JavaScript. Then you can get data you expect.

Vishnudev Krishnadas
  • 10,679
  • 2
  • 23
  • 55