0

I do struggle when trying to scrape some historical data from several sites with Selenium from https://coincodex.com/crypto/bitcoin/historical-data/. Somehow I do fail with the following steps:

  1. Get the data from the subsequent pages (not only for September, which is page 1)
  2. Replace '$ ' with '$' for each value
  3. Switch the value B (for billion) into a full number (1B into 1000000000)

The predefined task is: Web-Scrape all data since beginning of the year until end of September with Selenium and BeautifulSoup and transform into a pandas df. My code so far is:

from selenium import webdriver
import time

URL = "https://coincodex.com/crypto/bitcoin/historical-data/"

driver = webdriver.Chrome(executable_path = "/usr/local/bin/chromedriver")
driver.get(URL)
time.sleep(2)

webpage = driver.page_source

from bs4 import BeautifulSoup
Web page fetched from driver is parsed using Beautiful Soup.
HTMLPage = BeautifulSoup(driver.page_source, 'html.parser')

Table = HTMLPage.find('table', class_='styled-table full-size-table')

Rows = Table.find_all('tr', class_='ng-star-inserted')
len(Rows)

# Empty list is created to store the data
extracted_data = []
# Loop to go through each row of table
for i in range(0, len(Rows)):
 try:
  # Empty dictionary to store data present in each row
  RowDict = {}
  # Extracted all the columns of a row and stored in a variable
  Values = Rows[i].find_all('td')
  
  # Values (Open, High, Close etc.) are extracted and stored in dictionary
  if len(Values) == 7:
   RowDict["Date"] = Values[0].text.replace(',', '')
   RowDict["Open"] = Values[1].text.replace(',', '')
   RowDict["High"] = Values[2].text.replace(',', '')
   RowDict["Low"] = Values[3].text.replace(',', '')
   RowDict["Close"] = Values[4].text.replace(',', '')
   RowDict["Volume"] = Values[5].text.replace(',', '')
   RowDict["Market Cap"] = Values[6].text.replace(',', '')
   extracted_data.append(RowDict)
 except:
  print("Row Number: " + str(i))
 finally:
  # To move to the next row
  i = i + 1

extracted_data = pd.DataFrame(extracted_data)
print(extracted_data)

Sorry I'm new to Python and Web-Scraping and I hope, someone can help me. Would be very much appreciated.

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
FCH1922
  • 1
  • 1

2 Answers2

1

To extract Bitcoin (BTC) Historical Data from all the seven columns from the website Coincodex and print them in a text file you need to induce WebDriverWait for the visibility_of_all_elements_located() and then using List Comprehension you can create a list and subsequently create a DataFrame and finally export the values to a TEXT file excluding the Index using the following Locator Strategies:

Code Block:

driver.get("https://coincodex.com/crypto/bitcoin/historical-data/")
headers = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table th")))]
dates = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(1)")))]
opens = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(2)")))]
highs = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(3)")))]
lows = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(4)")))]
closes = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(5)")))]
volumes = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(6)")))]
marketcaps = [my_elem.text.replace('\u202f', ' ') for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.full-size-table tr td:nth-child(7)")))]
df = pd.DataFrame(data=list(zip(dates, opens, highs, lows, closes, volumes, marketcaps)), columns=headers)
print(df)
driver.quit()

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
    

Console Output:

            Date      Open      High       Low     Close     Volume Market Cap
0   Oct 30, 2021  $ 62,225  $ 62,225  $ 60,860  $ 61,661   $ 82.73B    $ 1.16T
1   Oct 31, 2021  $ 61,856  $ 62,379  $ 60,135  $ 61,340   $ 74.91B    $ 1.15T
2   Nov 01, 2021  $ 61,290  $ 62,368  $ 59,675  $ 61,065   $ 76.19B    $ 1.16T
3   Nov 02, 2021  $ 60,939  $ 64,071  $ 60,682  $ 63,176   $ 74.05B    $ 1.18T
4   Nov 03, 2021  $ 63,167  $ 63,446  $ 61,653  $ 62,941   $ 78.08B    $ 1.18T
5   Nov 04, 2021  $ 62,907  $ 63,048  $ 60,740  $ 61,368   $ 91.06B    $ 1.17T
6   Nov 05, 2021  $ 61,419  $ 62,480  $ 60,770  $ 61,026   $ 78.06B    $ 1.16T
7   Nov 06, 2021  $ 60,959  $ 61,525  $ 60,083  $ 61,416   $ 67.75B    $ 1.15T
8   Nov 07, 2021  $ 61,454  $ 63,180  $ 61,333  $ 63,180   $ 51.66B    $ 1.17T
9   Nov 08, 2021  $ 63,278  $ 67,670  $ 63,278  $ 67,500   $ 74.25B    $ 1.24T
10  Nov 09, 2021  $ 67,511  $ 68,476  $ 66,359  $ 66,913   $ 87.83B    $ 1.27T
11  Nov 10, 2021  $ 66,929  $ 68,770  $ 63,348  $ 64,871   $ 82.52B    $ 1.26T
12  Nov 11, 2021  $ 64,934  $ 65,580  $ 64,199  $ 64,800  $ 100.84B    $ 1.22T
13  Nov 12, 2021  $ 64,774  $ 65,380  $ 62,434  $ 64,315   $ 71.88B    $ 1.21T
14  Nov 13, 2021  $ 64,174  $ 64,850  $ 63,413  $ 64,471   $ 65.34B    $ 1.21T
15  Nov 14, 2021  $ 64,385  $ 65,255  $ 63,623  $ 65,255   $ 59.25B    $ 1.22T
16  Nov 15, 2021  $ 65,500  $ 66,263  $ 63,540  $ 63,716   $ 92.91B    $ 1.23T
17  Nov 16, 2021  $ 63,610  $ 63,610  $ 58,904  $ 60,190  $ 103.18B    $ 1.15T
18  Nov 17, 2021  $ 60,111  $ 60,734  $ 58,758  $ 60,339   $ 96.57B    $ 1.13T
19  Nov 18, 2021  $ 60,348  $ 60,863  $ 56,542  $ 56,749   $ 86.65B    $ 1.11T
20  Nov 19, 2021  $ 56,960  $ 58,289  $ 55,653  $ 58,047   $ 98.57B    $ 1.08T
21  Nov 20, 2021  $ 58,069  $ 59,815  $ 57,486  $ 59,815   $ 61.67B    $ 1.11T
22  Nov 21, 2021  $ 59,670  $ 59,845  $ 58,545  $ 58,681   $ 54.40B    $ 1.12T
23  Nov 22, 2021  $ 58,712  $ 59,061  $ 55,689  $ 56,370   $ 64.89B    $ 1.08T
24  Nov 23, 2021  $ 56,258  $ 57,832  $ 55,778  $ 57,673   $ 80.27B    $ 1.07T
25  Nov 24, 2021  $ 57,531  $ 57,694  $ 55,970  $ 57,103   $ 92.08B    $ 1.07T
26  Nov 25, 2021  $ 57,193  $ 59,333  $ 57,011  $ 58,907   $ 85.14B    $ 1.10T
27  Nov 26, 2021  $ 58,914  $ 59,120  $ 53,660  $ 53,664   $ 90.87B    $ 1.05T
28  Nov 27, 2021  $ 53,559  $ 55,204  $ 53,559  $ 54,487   $ 85.68B    $ 1.03T
29  Nov 28, 2021  $ 54,819  $ 57,315  $ 53,630  $ 57,159   $ 72.40B    $ 1.03T

References

You can find a relevant detailed discussion in:

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
0

Coincodex provides a query UI in which you can adjust the time range. After setting the start and end to the first of January and the 30th of September and clicking the "select" button, the site sends a GET request to the backend, using an endpoint of https://coincodex.com/api/coincodexcoins/get_historical_data_by_slug/bitcoin/2021-1-1/2021-9-30/1?t=5459791. If you send a request to this URL, you can get back all the data you need from this interval:

import requests, json
import pandas as pd
data = json.loads(requests.get('https://coincodex.com/api/coincodexcoins/get_historical_data_by_slug/bitcoin/2021-1-1/2021-9-30/1?t=5459791').text)
df = pd.DataFrame(data['data'])

Output:

              time_start             time_end  price_open_usd  ...  price_avg_ETH    volume_ETH  market_cap_ETH
0    2021-01-01 00:00:00  2021-01-02 00:00:00    28938.896888  ...      39.496780  8.728544e+07    7.341417e+08
1    2021-01-02 00:00:00  2021-01-03 00:00:00    29329.695772  ...      40.934106  9.351177e+07    7.608959e+08
2    2021-01-03 00:00:00  2021-01-04 00:00:00    32148.048500  ...      38.970510  1.448755e+08    7.244327e+08
3    2021-01-04 00:00:00  2021-01-05 00:00:00    32949.399464  ...      31.433580  1.292715e+08    5.843597e+08
4    2021-01-05 00:00:00  2021-01-06 00:00:00    32023.293433  ...      30.478852  1.186652e+08    5.666423e+08
..                   ...                  ...             ...  ...            ...           ...             ...
268  2021-09-26 00:00:00  2021-09-27 00:00:00    42670.363351  ...      14.438247  1.573066e+07    2.718238e+08
269  2021-09-27 00:00:00  2021-09-28 00:00:00    43204.962300  ...      14.157527  1.660821e+07    2.665518e+08
270  2021-09-28 00:00:00  2021-09-29 00:00:00    42111.843283  ...      14.439326  1.782125e+07    2.718712e+08
271  2021-09-29 00:00:00  2021-09-30 00:00:00    41004.598500  ...      14.510256  1.748895e+07    2.732201e+08
272  2021-09-30 00:00:00  2021-10-01 00:00:00    41536.594100  ...      14.454206  1.810257e+07    2.721773e+08

[273 rows x 23 columns]
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • Thank you for your answer - the problem is, that we have to do it with Selenium and web-scraping... – FCH1922 Nov 26 '21 at 16:16