2

I want to scrape codes only from below table using python

enter image description here

As in the Image, You can see I just want to scrape CPT, CTC, PTC, STC, SPT, HTC, P5TC, P1A, P2A P3A, P1E, P2E, P3E. This codes may change from time to time like the addition of P4E or removal of P1E.

HTML code for above table is:

<table class="list">
   <tbody>
      <tr>
         <td>
            <p>PRODUCT<br>DESCRIPTION</p>
         </td>
         <td>
            <p><strong>Time Charter:</strong> CPT, CTC, PTC, STC, SPT, HTC, P5TC<br><strong>Time Charter Trip:</strong> P1A, P2A, P3A,<br>P1E, P2E, P3E</p>
         </td>
         <td><strong>Voyage: </strong>C3E, C4E, C5E, C7E</td>
      </tr>
      <tr>
         <td>
            <p>CONTRACT SIZE</p>
            <p></p>
         </td>
         <td>
            <p>1 day</p>
         </td>
         <td>
            <p>1,000 metric tons</p>
         </td>
      </tr>
      <tr>
         <td>
            <p>MINIMUM TICK</p>
            <p></p>
         </td>
         <td>
            <p>US$ 25</p>
         </td>
         <td>
            <p>US$ 0.01</p>
         </td>
      </tr>
      <tr>
         <td>
            <p>FINAL SETTLEMENT PRICE</p>
            <p></p>
         </td>
         <td colspan="2" rowspan="1">
            <p>The floating price will be the end-of-day price as supplied by the Baltic Exchange.</p>
            <p><br><strong>All products:</strong> Final settlement price will be the mean of the daily Baltic Exchange spot price assessments for every trading day in the expiry month.</p>
            <p><br><strong>Exception for P1A, P2A, P3A:</strong> Final settlement price will be the mean of the last 7 Baltic Exchange spot price assessments in the expiry month.</p>
         </td>
      </tr>
      <tr>
         <td>
            <p>CONTRACT SERIES</p>
         </td>
         <td colspan="2" rowspan="1">
            <p><strong><strong>CTC, CPT, PTC, STC, SPT, HTC, P5TC</strong>:</strong> Months, quarters and calendar years out to a maximum of 72 months</p>
            <p><strong>C3E, C4E, C5E, C7E, P1A, P2A, P3A, P1E, P2E, P3E:</strong> Months, quarters and calendar years out to a maximum of 36 months</p>
         </td>
      </tr>
      <tr>
         <td>
            <p>SETTLEMENT</p>
         </td>
         <td colspan="2" rowspan="1">
            <p>At 13:00 hours (UK time) on the last business day of each month within the contract series</p>
         </td>
      </tr>
   </tbody>
</table>

You can see code from below link of website

https://www.eex.com/en/products/global-commodities/freight

2 Answers2

1

If your usecase is to scrape all the text:

timecharter

You you have to induce WebDriverWait for the desired visibility_of_element_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR:

    driver.get('https://www.eex.com/en/products/global-commodities/freight')
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "article div:last-child table>tbody>tr td:nth-child(2)>p"))).text)
    
  • Using XPATH:

    driver.get('https://www.eex.com/en/products/global-commodities/freight')
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p"))).text)
    
  • Console Output:

    Time Charter: CPT, CTC, PTC, STC, SPT, HTC, P5TC
    Time Charter Trip: P1A, P2A, P3A,
    P1E, P2E, P3E
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

Update 1

If you want to extract CPT, CTC, PTC, STC, SPT, HTC, P5TC and P1A, P2A, P3A and P1E, P2E, P3E individually, you can use the following solutions:

  • Printing CPT, CTC, PTC, STC, SPT, HTC, P5TC

    #element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "article div:last-child table>tbody>tr td:nth-child(2)>p")))
    element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p")))
    print(driver.execute_script('return arguments[0].childNodes[1].textContent;', element).strip())
    
  • Printing P1A, P2A P3A

    #element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "article div:last-child table>tbody>tr td:nth-child(2)>p")))
    element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p")))
    print(driver.execute_script('return arguments[0].childNodes[4].textContent;', element).strip())
    
  • Printing P1E, P2E, P3E

    //element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "article div:last-child table>tbody>tr td:nth-child(2)>p")))
    element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p")))
    print(driver.execute_script('return arguments[0].lastChild.textContent;', element).strip())
    

Update 2

To print all the items together:

  • Code Block:

    driver.get('https://www.eex.com/en/products/global-commodities/freight')
    element = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p")))
    first = driver.execute_script('return arguments[0].childNodes[1].textContent;', element).strip()
    second = driver.execute_script('return arguments[0].childNodes[4].textContent;', element).strip()
    third = driver.execute_script('return arguments[0].lastChild.textContent;', element).strip()
    for list in (first,second,third):
        print(list)
    
  • Console Output:

    CPT, CTC, PTC, STC, SPT, HTC, P5TC
    P1A, P2A, P3A,
    P1E, P2E, P3E
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
1

If variable txt contains HTML from your question, then this script extracts all required codes:

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(txt, 'html.parser')
text = soup.select_one('td:contains("Time Charter:")').text
codes = re.findall(r'[A-Z\d]{3}', text)

print(codes)

Prints:

['CPT', 'CTC', 'PTC', 'STC', 'SPT', 'HTC', 'P5T', 'P1A', 'P2A', 'P3A', 'P1E', 'P2E', 'P3E']

EDIT: To get codes from all tables, you can use this script:

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(txt, 'html.parser')
all_codes = []
for td in soup.select('td:contains("Time Charter:")'):
    all_codes.extend(re.findall(r'[A-Z\d]{3}', td.text))
print(all_codes)
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thanks Andrej. This helped. But can tell me what type of modification is needed if page contains 2 tables. So in this code (text = soup.select_one('td:contains("Time Charter:")').text) will only extract codes from 1st table. Page also contains 2nd table with codes under Time Charter. – chintan patel Jun 12 '20 at 08:59
  • Code output comes in 2 sets. Would be great if you can update code to get this in 1 set. Your Output: [ ['CPT', 'CTC', 'PTC', 'STC', 'SPT', 'HTC', 'P5TC', 'P1A', 'P2A', 'P3A', 'P1E', 'P2E', 'P3E'], ['OCPM', 'OCTM', 'OPTM', 'OTSM', 'OPSM', 'OHTM', 'O5PM'] ] Required Output: ['CPT', 'CTC', 'PTC', 'STC', 'SPT', 'HTC', 'P5TC', 'P1A', 'P2A', 'P3A', 'P1E', 'P2E', 'P3E','OCPM', 'OCTM', 'OPTM', 'OTSM', 'OPSM', 'OHTM', 'O5PM'] – chintan patel Jun 12 '20 at 09:43
  • 1
    Thank you very much. Excellent answer @Andrej Kesely. This is a complete solution. Really appreciate. – chintan patel Jun 12 '20 at 10:31