Dividing scraped text with Python and Beautiful Soup

Question

I've scraped the timetable from this website. The output I get is:

"ROUTE": "NAPOLI PORTA DI MASSA \u00bb ISCHIA"

but I would like:

"DEPARTURE PORT": "NAPOLI PORTA DI MASSA"
"ARRIVAL PORT": "ISCHIA"

How do I divide the string? Here is the code:

medmar_live_departures_table = list(soup.select('li.tratta'))
departure_time = []
for li in medmar_live_departures_table:
    next_li = li.find_next_sibling("li")
    while next_li and next_li.get("data-toggle"):
        if next_li.get("class") == ["corsa-yes"]:         
          #  departure_time.append(next_li.strong.text)
            medmar_live_departures_data.append({
            'ROUTE' : li.text
           })

Yes, but how do you do it? Sorry, I'm just starting to learn Python... — Daniela, Jan 29 '19 at 16:25

Jamil M. · Accepted Answer · 2019-01-30T06:07:27.650

Two things,

1.Since "»" is a non-ascii character python is returning the non-ascii character like so "\u00bb", hence parsing the string by splitting the text with the non-ascii code like so will work:

parse=li.get_text().split('\u00bb')

Also, you can use the re library to parse non-ascii characters like so (you will need to add the re library if you choose this path):

import re

non_ascii = li.get_text()
parse = re.split('[^\x00-\x7f]', non_ascii)
#[^\x00-\x7f] will select non-ascii characters as pointed out by Moinuddin Quadri in https://stackoverflow.com/questions/40872126/python-replace-non-ascii-character-in-string

However by doing so python will create a list of parts from the the parse but not all texts in the "li" html tag carry the "»" character (ie.the text "POZZUOLI-PROCIDA" at the end of the table on the website) so we must account for that or we'll run into some issues.

2.A dictionary may be a poor choice of data structure since the data you are parsing will have the same keys.

For example, POUZZOULI » CASAMICCIOLA, and POUZOULI » PROCIDA. COSMICCIOLA and PROCIDA will have the same key. Python will will simply overwrite/update the value of the POUZZOULI key. So POUZZOULI: CASAMICCIOLA will become POUZZOULI: PROCIDA instead of adding POUZZOULI: CASAMICCIOLA as a dictionary entry and POUZZOULI: PROCIDA as another dictionary entry.

I suggest adding each part of the parse into lists as tuples like so:

single_port= []
ports=[]

medmar_live_departures_table = list(bs.select('li.tratta'))
departure_time = []
for li in medmar_live_departures_table:
    next_li = li.find_next_sibling("li")
    while next_li and next_li.get("data-toggle"):
        if next_li.get("class") == ["corsa-yes"]:
            #  departure_time.append(next_li.strong.text)
            non_ascii = li.get_text()
            parse = re.split('[^\x00-\x7f]', non_ascii)

            # The if statement takes care of table data strings that don't have the non-ascii character "»" 
            if len(parse) > 1:
                ports.append((parse[0], parse[1]))

            else:
                single_port.append(parse[0])


# This will print out your data in your desired manner
for i in ports:
    print("DEPARTURE: "+i[0])
    print("ARRIVAL: "+i[1])

for i in single_port:
    print(i)

I also used the split method in a test code that I ran:

import requests
from bs4 import BeautifulSoup
import re

url="https://www.medmargroup.it/"
response=requests.get(url)
bs=BeautifulSoup(response.text, 'html.parser')


timeTable=bs.find('section', class_="primarystyle-timetable")

medmar_live_departures_table=timeTable.find('ul')
single_port= []
ports=[]


for li in medmar_live_departures_table.find_all('li', class_="tratta"):
    parse=li.get_text().split('\u00bb')

    if len(parse)>1:
        ports.append((parse[0],parse[1]))

    else:
        single_port.append(parse[0])


for i in ports:
    print("DEPARTURE: "+i[0])
    print("ARRIVAL: "+i[1])

for i in single_port:
    print(i)

I hope this helps!

Thank you so much! Excellent explanation and reply. People like you make Stackoverlow the great community it is. — Daniela, Feb 03 '19 at 15:24

score 0 · Answer 2 · answered Jan 30 '19 at 03:11

try this:

medmar_live_departures_table = list(soup.select('li.tratta'))
departure_time = []
for li in medmar_live_departures_table:
    next_li = li.find_next_sibling("li")
    while next_li and next_li.get("data-toggle"):
        if next_li.get("class") == ["corsa-yes"]:         
          #  departure_time.append(next_li.strong.text)
            medmar_live_departures_data.append({
            'DEPARTURE PORT' : li.text.split("\ u00bb")[0],
            'ARRIVAL PORT' : li.text.split("\ u00bb")[1]
           })

Dividing scraped text with Python and Beautiful Soup

2 Answers2