1

I am trying to make a csv in a daily basis from a specific website table: https://lunarcrush.com/exchanges

I've tried to use every single piece of advice on the related topics here (eg. How to extract tables from websites in Python , Python Extract Table from URL to csv , extract a html table data to csv file and many many more)

I thought that my initial problem was that I didn't have the table id (such as in other examples, I've only found the (table) class name MuiTable-root. But after a little more digging up I found out that whenever I was reading the url, the HTML code I was getting was completely different, rather than the one I see when I use Inspect(O) click on my browser.

I've tried almost everything I found here, so I am not sure if it helps to quote every singe code. As an example I just quote the following, that I was trying to make it work. The idea is simple (to find the tr part of the table and get the th (header) and td (data), and after that I'd extract them to a csv.

from lxml import etree
import urllib.request

web = urllib.request.urlopen("https://lunarcrush.com/exchanges")
s = web.read()

html = etree.HTML(s)

## Get all 'tr'
tr_nodes = html.xpath('//table[@class="MuiTableHead-root"]/tr')

## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("th")]

## Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]

print(td_content)

Any ideas? I am sorry for my long (and maybe silly) question, I am just starting to use python, and there are still lots to learn!

Maria P.
  • 13
  • 5
  • 3
    It looks the site is using javascript. you need tool like Selenium. – buran Sep 21 '21 at 13:39
  • After further inspection of all requests when you make in browser, there is url https://api2.lunarcrush.com/v2/assets?data=exchanges&key=eadlcmjl2yeabmluwcogxd that returns json data in the table. However you need to look further and see if you can extract the `key` parameter, it will change I expect – buran Sep 21 '21 at 13:45
  • I'll have a look at it! Thank you! – Maria P. Sep 21 '21 at 13:51
  • how did you extract this, so I can learn how to do it? – Maria P. Sep 21 '21 at 13:52
  • 1
    Look at `Network` tab in Developers tools in browser. Switch to Network tab and refresh the page, so that you can see all teh info – buran Sep 21 '21 at 13:55
  • Am I looking for a json file, there? – Maria P. Sep 21 '21 at 14:16
  • 1
    in this case it says `plain` which is bit unusual. However, note that the url has `key` parameter and I expect it will change over time/session. So Selenium is probably the better approach to explore – buran Sep 21 '21 at 14:17

1 Answers1

2

Use pandas to collect a dataframe and selenium to populate it.

You can install them at the terminal typing:

pip install pandas
pip install selenium
pip install webdriver-manager

More information about selenium can be found: https://selenium-python.readthedocs.io/ and pandas https://pandas.pydata.org/docs/

More info about the installation and drivers can be found: https://selenium-python.readthedocs.io/installation.html

Usually you will download the driver and make it run as you can see in the documentation. But the webdriver_manager will do something similar automatically.

driver = webdriver.Chrome(ChromeDriverManager().install())

In the code you need to import the packages you installed, i.e. pandas and selenium.

from selenium import webdriver 
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd

You instantiate your webdriver (open chrome controlled by python). in the variable "driver". You find the xpath of your page with driver "find_elements_by_xpath" and extract an attribute from it (when necessary).

xpath = '//tbody//tr'
driver.find_elements_by_xpath(xpath)
row = [ i.text for i in driver.find_elements_by_xpath(xpath)]

Finally you make a list out of the content you will find, define a dictionary to save the data in a pandas dataframe and export it to your csv file:

dictionary = {'row ': row}
df = pd.DataFrame(dictionary)
df.to_csv("filename")

The whole thing should look like this:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://lunarcrush.com/exchanges")

xpath = '//tbody//tr'
row = [i.text for i in driver.find_elements_by_xpath(xpath)]

# notice that i.text is not always necessary depends on the attribute of the html element.
# In your case you may want to include some more things like the header and the index and also split the row look .str.split('') at pandas
dictionary = {'row ': row}
df = pd.DataFrame(dictionary)
df.to_csv("filename")
Ziur Olpa
  • 1,839
  • 1
  • 12
  • 27
  • Thanks for the reply!! But I agree! I am trying to find out how to resolve those problems now. I am just reading about selenium now – Maria P. Sep 21 '21 at 15:33
  • 1
    I hope now is better – Ziur Olpa Sep 21 '21 at 15:47
  • Indeed it is much better, but I am still struggling to run the code. `python script.py` returns `ModuleNotFoundError: No module named 'selenium'` even if i tried to install it. `python3 script.py` gives me `NameError: name 'ChromeDriverManager' is not defined` – Maria P. Sep 21 '21 at 15:54
  • do first at the teminal: pip install selenium, in ipython/jupyter you should write a exclamation mark first "!pip install selenium" – Ziur Olpa Sep 21 '21 at 15:55
  • I forgot it, add: pip install webdriver-manager – Ziur Olpa Sep 21 '21 at 16:02
  • i've done that, i know how to install in python, but i still get the same error. `Requirement already satisfied: selenium in /usr/local/lib/python3.7/site-packages (3.141.0) Requirement already satisfied: urllib3 in /home/taleporos/.local/lib/python3.7/site-packages (from selenium) (1.26.6)` – Maria P. Sep 21 '21 at 16:02
  • `pip install webdriver-manager` i've done it too... still the same! I've looked a lot in stackoverflow about this error... i think i've tried most of the suggestions – Maria P. Sep 21 '21 at 16:03
  • add to the code: from webdriver_manager.chrome import ChromeDriverManager, just edited my post, you can copy and paste, basically you need to import every function that you use in your code, if is not bare python. – Ziur Olpa Sep 21 '21 at 16:04
  • ffs... i am going to get crazy... after `pip install pandas pip install selenium pip install webdriver-manager` and the changes you've suggested I am still getting `File "script.py", line 1, in from selenium import webdriver ModuleNotFoundError: No module named 'selenium'` I feel stupid... I am sorry! – Maria P. Sep 21 '21 at 16:27
  • pip3 install selenium, it looks you have a mess with python, use python with pip or python3 with pip3. Anyway you can use also conda environments to avoid such a mess. – Ziur Olpa Sep 21 '21 at 16:32
  • if python3 script.py complains about 'ChromeDriverManager' is not defined, means that selenium is already installed there, so pip3 install webdriver-manager and then python3 script.py – Ziur Olpa Sep 21 '21 at 16:33
  • Still nothing... so yeah it seams that i have a huge python mess... I'll try to clean everything and install it again from scratch. I hope i'll manage to run your suggestion – Maria P. Sep 21 '21 at 16:47
  • one last question, is it possible to use a firefox driver? I am guessing `driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())` after installing the GeckoDriver, right? – Maria P. Sep 21 '21 at 16:55
  • 1
    Sure it is possible, but if you have already the Geckodriver, you can forget about the DriverManager, this works for example: options = webdriver.FirefoxOptions() options.add_argument('--ignore-certificate-errors') options.add_argument('--incognito') driver = webdriver.Firefox(options=options) – Ziur Olpa Sep 21 '21 at 18:26
  • It doesn't seam to work! I've tried your code @PABLO. I am not sure how your dictionary works `dictionary = {'row ': row}` It just prints `,row` so I am guessing it doesn't work properly... – taleporos Sep 22 '21 at 14:50
  • My answer was not extensive, I assumed that the xpath at the original question was ok, (it wasn't). Now I've corrected it and it will return the table. You may still may want to get the header and index... that won't be hard.. if you put those lines here i will update it again. – Ziur Olpa Sep 22 '21 at 16:17
  • Oh I see, ok! But still I am not getting the table with the updated code! Does it work for you? – taleporos Sep 22 '21 at 16:56
  • @PABLORUIZCUEVAS I confirm it! I've managed to make it work, but it prints only the word `, row` with no table (both the previous and the updated code), I am still trying to make it work – Maria P. Sep 22 '21 at 17:07
  • It works flawless in my computer, python 3.8 google-chrome version is 93.0.4577. I'm afraid I can't help you any more. Please do your own debugging, If it returns ", row" it means that no data is not being loaded... try to import time time.sleep(1) before the "find" or try to look for a xpath that works in your browser. or just use my version of chrome (not firefox). – Ziur Olpa Sep 22 '21 at 17:22
  • i'll try all of them! thank you very much! one last question, which may solve the case, how do you get the right xpath? – Maria P. Sep 22 '21 at 17:28
  • F12 in the chrome, inspector, use the selection tool, and a bit of trial and error. read https://selenium-python.readthedocs.io/locating-elements.html, and in general basic knowledge of web software will be useful. – Ziur Olpa Sep 22 '21 at 17:39
  • edit: yeah you are right! it works like a charm after the time.sleep(5) but it might help if you explain how to get correctly the xpath for a future work and a more complete explanation! – Maria P. Sep 22 '21 at 17:43