Scraping Data from a website which uses Power BI - retrieving data from Power BI on a website

Question

I want to scrape data from this page (and pages similar to it): https://cereals.ahdb.org.uk/market-data-centre/historical-data/feed-ingredients.aspx

This page uses Power BI. Unfortunately, finding a way to scrape Power BI is hard, because everyone wants to scrape using/into Power BI, not from it. The closest answer was this question. Yet unrelated.

Firstly, I used Apache tika, and soon I realized the table data is been loading after loading the page. I need the rendered version of the page.

Therefore, I used Selenium. I wanted to Select All at the begining (sending Ctrl+A key combination), but it doesn't work. Maybe it is restricted by the page events (I also tried to remove all the events using developer tools, yet still Ctrl+A doesn't work.

I also tried to read the HTML contents, but Power BI puts div elements on the screen using position:absolute and distinguishing the location of a div in the table (both row and column) is an effortful activity.

Since Power BI uses JSON, I tried to read data from there. However it is so complicated I couldn't find out the rules. It seems it puts keywords somewhere and uses their indices in the table.

Note: I realized that all of the data is not loaded and even shown at the same time. A div of class scroll-bar-part-bar is responsible to act as a scroll bar, and moving that loads/shows other parts of the data.

The code I used to read data is as follows. As mentioned, the order of the produced data differs from what is rendered on the browser:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

options = webdriver.ChromeOptions()
options.binary_location = "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe"
driver = webdriver.Chrome(options=options, executable_path="C:/Drivers/chromedriver.exe")

driver.get("https://app.powerbi.com/view?r=eyJrIjoiYjVjM2MyNjItZDE1Mi00OWI1LWE5YWYtODY4M2FhYjU4ZDU1IiwidCI6ImExMmNlNTRiLTNkM2QtNDM0Ni05NWVmLWZmMTNjYTVkZDQ3ZCJ9")
parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div')
children = parent.find_elements_by_xpath('.//*')
values = [child.get_attribute('title') for child in children]

I appreciate solutions for any of the above problems. The most interesting for me though, is the convention of storing Power BI data in JSON format.

@user308827 honestly you probably should have created your own question for the bounty, scraping from map seems like a different process. Alternatively you can just make a web request to obtain the map data manually probably — johnny 5, Feb 05 '20 at 22:45

score 8 · Accepted Answer · answered Mar 13 '19 at 18:33

Putting the scroll part and the JSON aside, I managed to read the data. The key is to read all of the elements inside the parent (which is done in the question):

parent = driver.find_element_by_xpath('//*[@id="pvExplorationHost"]/div/div/div/div[2]/div/div[2]/div[2]/visual-container[4]/div/div[3]/visual/div')
children = parent.find_elements_by_xpath('.//*')

Then sort them using their location:

x = [child.location['x'] for child in children]
y = [child.location['y'] for child in children]
index = np.lexsort((x,y))

To sort what we have read in different lines, this code may help:

rows = []
row = []
last_line = y[index[0]]
for i in index:
    if last_line != y[i]:
        row.append[children[i].get_attribute('title')]
    else:
        rows.append(row)
        row = list([children[i].get_attribute('title')]
rows.append(row)

score 4 · Answer 2 · edited May 03 '21 at 20:02

A few more details about exactly which data you are trying to scrape would have helped to construct a canonical answer. However, to scrape the data within the Commodity and Basis using Selenium, as the the desired element is within an <iframe> so you have to:

Induce WebDriverWait for the desired frame_to_be_available_and_switch_to_it().
Induce WebDriverWait for the desired visibility_of_element_located() for the table.
Induce WebDriverWait for the desired visibility_of_all_elements_located() for the desired data.
You can use the following Locator Strategies:

Code Block:

     from selenium import webdriver
     from selenium.webdriver.common.by import By
     from selenium.webdriver.support.ui import WebDriverWait
     from selenium.webdriver.support import expected_conditions as EC

     options = webdriver.ChromeOptions() 
     options.add_argument("start-maximized")
     options.add_experimental_option("excludeSwitches", ["enable-automation"])
     options.add_experimental_option('useAutomationExtension', False)
     driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
     driver.get("https://ahdb.org.uk/cereals-oilseeds/feed-ingredient-prices")
     WebDriverWait(driver, 20).until(EC.frame_to_be_available_and_switch_to_it((By.TAG_NAME,"iframe")))
     WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.innerContainer")))
     print("Commodity:")
     print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='pivotTableCellWrap cell-interactive tablixAlignLeft ' and starts-with(@title, 'Ex-')]//parent::div//preceding::div[1]")))])
     print("-=-=-=-=-=-")
     print("Basis:")
     print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.pivotTableCellWrap.cell-interactive.tablixAlignLeft[title^='Ex-']")))])

Console Output:

     Commodity:
     ['Argentine Sunflowermeal 32/33%', 'Maize Gluten Feed', 'Pelleted Wheat Feed', 'Rapemeal (34%)', 'Soyameal (Hi Pro)', 'Soyameal, Brazilian (48%)']
     -=-=-=-=-=-
     Basis:
     ['Ex-Store Liverpool', 'Ex-Store Liverpool', 'Ex-Mill Midlands and Southern Mills', 'Ex-Mill Erith', 'Ex-Store East Coast', 'Ex-Store Liverpool']

Update (as per bounty explanation)

As per your comment as well as the given link on the bounty explanation, to scrape the data from Page 2 within the table under the heading Scouting Location using Selenium, you can use the following solution. For the sake of demonstration I have created a List of first 20 countries and you can expand as much as you wish:

Code Block:

  from selenium import webdriver
  from selenium.webdriver.support.ui import WebDriverWait
  from selenium.webdriver.common.by import By
  from selenium.webdriver.support import expected_conditions as EC

  options = webdriver.ChromeOptions() 
  options.add_argument("start-maximized")
  options.add_experimental_option("excludeSwitches", ["enable-automation"])
  options.add_experimental_option('useAutomationExtension', False)
  driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
  driver.get("https://app.powerbi.com/view?r=eyJrIjoiMzE1ODNmYzQtMWZhYS00NTNjLTg1MDUtOTQ2MGMyNDVkZTY3IiwidCI6IjE2M2FjNDY4LWFiYjgtNDRkMC04MWZkLWQ5ZGIxNWUzYWY5NiIsImMiOjh9")
  WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//span[@class='navigation-wrapper navigation-wrapper-big']//i[@title='Next Page']"))).click()
  print("Country:")
  print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='bodyCells']//div[@class='pivotTableCellWrap cell-interactive ']")))[:20]])
  driver.quit()

Console Output:

  DevTools listening on ws://127.0.0.1:49438/devtools/browser/1b5a2590-5a90-47fd-93c7-cfcf58a6c241
  Country:
  ['Myanmar', 'Myanmar', 'Mozambique', 'Malawi', 'Malawi', 'Mozambique', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Malawi', 'Myanmar', 'Myanmar', 'Myanmar']

Console Output Snapshot:

thanks @DebanjanB, the data I am trying to scrape is here: https://app.powerbi.com/view?r=eyJrIjoiMzE1ODNmYzQtMWZhYS00NTNjLTg1MDUtOTQ2MGMyNDVkZTY3IiwidCI6IjE2M2FjNDY4LWFiYjgtNDRkMC04MWZkLWQ5ZGIxNWUzYWY5NiIsImMiOjh9 It is in the form of a table on sections 2 and 3 on that link — user308827, Feb 07 '20 at 00:38
@user308827 Checkout the updated answer and let me know the status. — undetected Selenium, Feb 07 '20 at 10:53
when I run that code, it opens a browser with the site but does not print anything — user308827, Feb 08 '20 at 00:02
@user308827 Checkout the updated answer and let me know the status. — undetected Selenium, Feb 08 '20 at 21:56
thanks @DebanjanB, why are only the countries showing and not the rest of the table? — user308827, Feb 09 '20 at 00:14
this is awesome thanks! how would you handle a large table that you need to scroll through? — Umar.H, Apr 06 '20 at 11:03
@Datanovice Can you please raise a new question with the relevant HTML. I will be happy to take it forward. — undetected Selenium, Apr 06 '20 at 20:20
@Datanovice did you ever figure out how to do this? `this is awesome thanks! how would you handle a large table that you need to scroll through? – Datanovice` — runnerpaul, May 12 '20 at 18:56

Scraping Data from a website which uses Power BI - retrieving data from Power BI on a website

2 Answers2

Update (as per bounty explanation)

Linked

Related