Python Scraping to extract names of vessels

Question

from scrapy import Selector
import requests
url = 'http://lines.coscoshipping.com/home/Services/ship/0'
html = requests.get(url).content
sel = Selector(text = html)
sel.xpath('//tr/td[1]/div/div/div[2]/div[1]/text()').extract()

'''Can anyone help with this scraping, and I just want to extract the names of each vessels. Many thanks in advance'''

What's wrong with it? You need to add more detail to your question — Pentium1080Ti, Feb 01 '21 at 13:15
Looks like the data on that page is loaded using JavaScript. Try using [Selenium](https://selenium-python.readthedocs.io/) instead of `requests` (which won't give you the dynamic content). — costaparas, Feb 01 '21 at 13:26

score 1 · Accepted Answer · answered Feb 01 '21 at 19:49

The following script should produce the results accordingly.

import requests

link = 'http://lines.coscoshipping.com/homeapi/ship/findShips.do?slots=0&language=1'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    s.headers['Referer'] = 'http://lines.coscoshipping.com/home/Services/ship/0'
    r = s.get(link)
    for item in r.json()['data']['content']:
        print(item['shipNameCn'],item['shipNameEn'])

Output are like:

中海太平洋 CSCL PACIFIC OCEAN
中海印度洋 CSCL INDIAN OCEAN
中海大西洋 CSCL ATLANTIC OCEAN
中海之星 CSCL STAR
中海土星 CSCL SATURN
中海天王星 CSCL URANUS
中海水星 CSCL MERCURY
中海木星 CSCL JUPITER
中海金星 CSCL VENUS
中海火星 CSCL MARS
中海海王星 CSCL NEPTUNE
中海阿里山 JEBEL ALI

costaparas · Answer 2 · 2021-02-01T22:05:05.563

0

The problem:

Inspecting the HTML of your shown URL, it appears that the page content is mostly loaded dynamically. You can use a tool that can run the JavaScript to generate the page content so you can then extract the relevant information.

The requests library won't do this for you. You can instead use the selenium library.

Using Selenium:

Firstly, observe that the HTML for the vessel names looks like this:

<div data-v-5c859f2b="" class="names">
    <div data-v-5c859f2b="">中海太平洋</div>
    <div data-v-5c859f2b="">CSCL PACIFIC OCEAN</div>
</div>

The code below find_elements_by_class_name() to extract the HTML tags with the class names (which is used for the the vessel names).

Then, find_elements_by_tag_name() is used to find the child div tags, which contain the Chinese and English names.

from selenium import webdriver

import textwrap


url = 'http://lines.coscoshipping.com/home/Services/ship/0'
driver = webdriver.Firefox(executable_path='YOUR PATH')  # or Chrome
driver.get(url)
for vessel in driver.find_elements_by_class_name('names'):
    chinese, english = vessel.find_elements_by_tag_name('div')
    print(textwrap.dedent(f'''
        Chinese: {chinese.text}
        English: {english.text}
    '''))

I've also used textwrap.dedent() to prettify the output.

Example output:

Chinese: 中海太平洋
English: CSCL PACIFIC OCEAN


Chinese: 中海印度洋
English: CSCL INDIAN OCEAN


Chinese: 中海大西洋
English: CSCL ATLANTIC OCEAN


Chinese: 中海之星
English: CSCL STAR

...

See also this post about how to download a driver (either Chrome or Firefox) and add it to the $PATH.

An alternative way:

Using splitlines(), we can extract the Chinese and English names of the vessels more succinctly from each of the divs:

from selenium import webdriver

import textwrap


url = 'http://lines.coscoshipping.com/home/Services/ship/0'
driver = webdriver.Firefox(executable_path='YOUR PATH')  # or Chrome
driver.get(url)
for vessel in driver.find_elements_by_class_name('names'):
    chinese, english = vessel.text.splitlines()
    print(textwrap.dedent(f'''
        Chinese: {chinese}
        English: {english}
    '''))

This is a bit more presumptive, but often does work (as in this case).

edited Feb 01 '21 at 22:05

answered Feb 01 '21 at 13:44

costaparas

5,047
11
16
26

Many thanks, I tried to use the chrome driver but there is still a minor problem in the decoding part below. – BurgerQueen Feb 03 '21 at 12:35
from selenium import webdriver import textwrap url = 'http://lines.coscoshipping.com/home/Services/ship/0' driver = webdriver.chrome(executable_path = 'C:\Users\my_name\Desktop\chromedriver.exe') driver.get(url) for vessel in driver.find_elements_by_class_name('names'): chinese, english = vessel.find_elements_by_tag_name('div') print(textwrap.dedent(f''' Chinese: {chinese.text} English: {english.text} ''')) – BurgerQueen Feb 03 '21 at 12:38
The error is "SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape" – BurgerQueen Feb 03 '21 at 12:40
This is due to the way you declared your `executable_path`. Its a simple fix (see [this post](https://stackoverflow.com/questions/37400974/unicode-error-unicodeescape-codec-cant-decode-bytes-in-position-2-3-trunca)) – costaparas Feb 03 '21 at 12:42
And I also tried the codes in firefox, and it showed the error again. – BurgerQueen Feb 03 '21 at 14:13
from selenium import webdriver import textwrap url = 'http://lines.coscoshipping.com/home/Services/ship/0' driver = webdriver.Firefox(executable_path = r'C:\Users\my_name\Desktop\geckodriver.exe') driver.get(url) for vessel in driver.find_elements_by_class_name('names'): chinese, english = vessel.find_elements_by_tag_name('div') print(textwrap.dedent(f''' Chinese: {chinese.text} English: {english.text} ''')) – BurgerQueen Feb 03 '21 at 14:14
It looks ok to me, except the URL is missing `http://` at the start. Which line does the error occur on? – costaparas Feb 03 '21 at 14:17
Sorry for my late reply, I have "http://" in my code. The error happened on line 5 "driver = webdriver.Firefox(executable_path = r'C:\Users\my_name\Desktop\geckodriver.exe')". And the error is "SessionNotCreatedException: Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line". I put the geckodriver.exe on desktop and opened jupyter notebook through firefox – BurgerQueen Feb 05 '21 at 16:05
Either Firefox is not installed, or its not installed at the default location, see [this post](https://stackoverflow.com/questions/64908154/sessionnotcreatedexception-message-expected-browser-binary-location-but-unabl) for resolution. – costaparas Feb 05 '21 at 23:23
Many Thanks I'll try this way – BurgerQueen Feb 06 '21 at 11:27
And may I consult you about another question, since I really want to learn this method from you. You're an expert. Can I also extract the "year" column with the same method? The "year" column is just beside the "vessel" column. – BurgerQueen Feb 06 '21 at 12:31
Yes, you can extract anything on the page you like. But it seems there is also an [undocumented API](http://lines.coscoshipping.com/homeapi/ship/findShips.do?slots=0&language=1) the site uses internally as shown in the [other answer](https://stackoverflow.com/a/65999212/14722562), so you could also do it that way as well. – costaparas Feb 06 '21 at 12:54
Thanks for your answer, that's really helpful – BurgerQueen Feb 06 '21 at 15:39

Python Scraping to extract names of vessels

2 Answers2

The problem:

Using Selenium:

Example output:

An alternative way: