0

I'm trying to scrape data from a High Charts table from the following URL:

https://www.pricecharting.com/game/pal-nes/legend-of-zelda

The following code will grab the data from the default chart data labelled (Loose), but I need to extract data from the other items listed on the chart. (CIB, New, Graded, Boxed, Manual). I'm really stuck and not sure how to get the information from the other charts.

Here's my code which works for 'Loose' only (The default option).

import time

import pandas as pd
from selenium.webdriver.chrome.service import Service
from selenium import webdriver

service = Service(executable_path="/driver_selenium/geckodriver.exe")
driver = webdriver.Firefox(service=service)

website = "https://www.pricecharting.com/game/pal-nes/legend-of-zelda#completed-auctions-graded"


driver.get(website)
time.sleep(5)

temp = driver.execute_script('return window.Highcharts.charts[0]'
                             '.series[0].options.data')

data = [item[1] for item in temp]
print(data)

Bonus points if there's a better way to extract the data which does not use Selenium, something similar to the answer here: Scrape highchart into python

Lee Roy
  • 297
  • 1
  • 11
  • Hi @Lee Roy, There are 6 series on the chart. You can get the needed one by changing series index, for example: `return window.Highcharts.charts[0].series[5].options.data` – ppotaczek Jul 29 '22 at 11:03
  • Thanks this is what I needed! I tried this before posting but it doesn't work with my code spread over two lines. Yours works flawlessly! If you create an answer I'll mark it as best answer. Thanks! – Lee Roy Jul 29 '22 at 13:18

1 Answers1

1

EDIT: We can try the following:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import json

url = 'https://www.pricecharting.com/game/pal-nes/legend-of-zelda'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
data_script = soup.find('script', string=re.compile("VGPC.chart_data = {"))
# print(data_script.text.split('VGPC.chart_data = ')[1].split(' VGPC.product = {')[0].split(';')[0].strip())
data = json.loads(data_script.text.split('VGPC.chart_data = ')[1].split(' VGPC.product = {')[0].split(';')[0].strip())
df = pd.DataFrame(data)
print(df)

Which returns:

boxonly cib graded  manualonly  new used
0   [1498888800000, 0]  [1498888800000, 17004]  [1498888800000, 0]  [1498888800000, 0]  [1498888800000, 0]  [1498888800000, 6510]
1   [1501567200000, 0]  [1501567200000, 7878]   [1501567200000, 0]  [1501567200000, 0]  [1501567200000, 23100]  [1501567200000, 5409]
2   [1504245600000, 0]  [1504245600000, 8879]   [1504245600000, 0]  [1504245600000, 0]  [1504245600000, 23100]  [1504245600000, 5169]
3   [1506837600000, 0]  [1506837600000, 8432]   [1506837600000, 0]  [1506837600000, 0]  [1506837600000, 37665]  [1506837600000, 4499]
4   [1509516000000, 0]  [1509516000000, 9286]   [1509516000000, 0]  [1509516000000, 0]  [1509516000000, 37665]  [1509516000000, 4513]
... ... ... ... ... ... ...

This dataframe contains 6 columns with historical prices, which are being used in that Highchart graph (prices appears to be in cents). This method avoids the overheads of selenium/chromedriver.

We can now look at individual line charts, for example at 'new':

df_new = pd.DataFrame(data['new'], columns = ['Date_time', 'Price'])
df_new['Date_time'] = pd.to_datetime(df_new['Date_time'], unit="ms")
print(df_new)

And the result is:

Date_time   Price
0   2017-07-01 06:00:00 0
1   2017-08-01 06:00:00 23100
2   2017-09-01 06:00:00 23100
3   2017-10-01 06:00:00 37665
4   2017-11-01 06:00:00 37665
... ... ...
Barry the Platipus
  • 9,594
  • 2
  • 6
  • 30
  • thanks you for the reply. Unfortunately those tables only show the last 30 sales. The data contained in the High Chart JS contains the last five years of data. (Which is what I need). Thank you though! – Lee Roy Jul 29 '22 at 13:16
  • You're welcome. I have edited my answer, and I think it now pulls the data you're after. – Barry the Platipus Jul 29 '22 at 13:45