1

HTML source code I am working on an independent project where I want to scrape all historical data from a cryptocurrency and store in a python pandas df. I have identified the structure of the html page, and have the following code

from bs4 import BeautifulSoup
import urllib3
import requests
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


bitcoin_df = pd.DataFrame(columns = ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Market Cap'])

bitcoin_url = "https://coinmarketcap.com/currencies/bitcoin/historical-data/"
bitcoin_content = requests.get(bitcoin_url).text
bitcoin_soup = BeautifulSoup(bitcoin_content, "lxml")
#print(bitcoin_soup.prettify())

bitcoin_table = bitcoin_soup.find("table", attrs={"class": "h7vnx2-2 hLKazY cmc-table  "})
bitcoin_table_data = bitcoin_table.find_all("tr")

for tr in bitcoin_table_data:
    tds = tr.find_all("td")
    for td in tds:
        bitcoin_df.append({'Date': td[0].text, 'Open': td[1].text, 'High': td[2].text, 'Low': td[3].text, 'Close': td[4].text, 'Volume': td[5].text, 'Market Cap': td[6].text})

However, I encounter this error:

>AttributeError                            Traceback (most recent call last)
<ipython-input-46-316341b6771b> in <module>
      7 
      8 bitcoin_table = bitcoin_soup.find("table", attrs={"class": "h7vnx2-2 hLKazY cmc-table  "})
----> 9 bitcoin_table_data = bitcoin_table.find_all("tr")
     10 
     11 #for tr in bitcoin_soup.find_all('tr'):
>AttributeError: 'NoneType' object has no attribute 'find_all'

1 Answers1

0

You are getting that error because the .find() called returned None to indicate it could not locate the table. The table is created by Javascript inside a browser so will not be present.

Rather than trying to parse the HTML, you could just request the data directly from their API (as the browser does). For example:

import pandas as pd
import requests
import time

ts = int(time.time())
json_url = f"https://api.coinmarketcap.com/data-api/v3/cryptocurrency/historical?id=1&convertId=2781&timeStart={ts - 5270400}&timeEnd={ts}"
json_req = requests.get(json_url)
json_data = json_req.json()
                                                            
data = []

for quote in json_data['data']['quotes']:
    data.append([
        quote['quote']['timestamp'],
        quote['quote']['open'],
        quote['quote']['high'],
        quote['quote']['low'],
        quote['quote']['close'],
        quote['quote']['volume'],
        quote['quote']['marketCap'],
    ])
    
df = pd.DataFrame(data, columns = ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Market Cap'])
print(df)

Which would give you a dataframe starting:

                        Date          Open          High           Low         Close        Volume    Market Cap
0   2021-09-13T23:59:59.999Z  46057.215327  46598.678985  43591.320785  44963.072633  4.096994e+10  8.459805e+11
1   2021-09-14T23:59:59.999Z  44960.049359  47218.125355  44752.331349  47092.493833  3.865215e+10  8.860953e+11
2   2021-09-15T23:59:59.999Z  47097.998123  48450.468466  46773.326543  48176.346393  3.048450e+10  9.065325e+11

This URL was found by watching the browser request the data using its own developer tools. I suggest you print(json_data) to see what was returned.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • Thank you! I would upvote you, but I don't currently have enough points to do so apparently. Two follow up questions if I may. – Joseph_Koziol Nov 12 '21 at 15:36
  • 1. to confirm, I go to the inspection page, look at the network tag and see what type is being sent out. I see there is a JSON request and copy the link and that should direct me to the raw data? Now what if there is more than one JSON request, such as in this site https://www.asicminervalue.com/ – Joseph_Koziol Nov 12 '21 at 15:39
  • You're welcome! You can though click on the grey tick under the up/down buttons to select the answer as the accepted solution – Martin Evans Nov 12 '21 at 15:39
  • 2. More specific to the original question, what If I want to grab more historical data than just the 60 initially listed? I noticed there's a "load more" tab that you can click to show more, however the only data that is loaded is what is currently showing on the screen – Joseph_Koziol Nov 12 '21 at 15:40
  • The site in your question was quite straightforward. It normally requires a lot more work to try and recreate the correct URLs, cookies etc to let it work – Martin Evans Nov 12 '21 at 15:40
  • a2. Their API takes a start and end time. I just went back the same amount of time that their own system did. In theory just subtract a bigger time from `ts` (look at `json_url`) – Martin Evans Nov 12 '21 at 15:42
  • For sure. Thank you for your help, very much appreciated – Joseph_Koziol Nov 12 '21 at 15:49