2

This is a result of a find_all('a') (it's very long):

</a>, <a class="btn text-default text-dark clear_filters pull-right group-ib" href="#" id="export_dialog_close" title="Cancel"><span class="glyphicon glyphicon-remove"></span><span>Cancel</span></a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:SHIPNAME/direction:asc">Vessel Name</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:TIMESTAMP_UTC/direction:asc">Timestamp</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:PORT_NAME/direction:asc">Port</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:MOVE_TYPE_NAME/direction:asc">Port Call type</a>, <a href="/ais/index/port_moves/all/include_anchs:no/ship_type:7/_:3525d580eade08cfdb72083b248185a9/in_transit:no/time_interval:1474948800.0_1475035200.00/per_page:50/portname:NOVOROSSIYSK/cb:6651/move_type:1/sort:ELAPSED/direction:asc">Time Elapsed</a>, <a href="/en/ais/details/ships/shipid:465271/imo:9495595/mmsi:373571000/vessel:SIDER%20LUCK/_:3525d580eade08cfdb72083b248185a9" title="View details for: SIDER LUCK">SIDER LUCK</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/163/port_name:MILAZZO/_:3525d580eade08cfdb72083b248185a9" title="View details for: MILAZZO">MILAZZO</a>, <a href="/en/ais/details/ships/shipid:288753/imo:9389693/mmsi:249474000/vessel:OOCL%20ISTANBUL/_:3525d580eade08cfdb72083b248185a9" title="View details for: OOCL ISTANBUL">OOCL ISTANBUL</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/17436/port_name:AMBARLI/_:3525d580eade08cfdb72083b248185a9" title="View details for: AMBARLI">AMBARLI</a>, <a href="/en/ais/details/ships/shipid:754480/imo:9045613/mmsi:636014098/vessel:TK%20ROTTERDAM/_:3525d580eade08cfdb72083b248185a9" title="View details for: TK ROTTERDAM">TK ROTTERDAM</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/3504/port_name:DILISKELESI/_:3525d580eade08cfdb72083b248185a9" title="View details for: DILISKELESI">DILISKELESI</a>, <a href="/en/ais/details/ships/shipid:412277/imo:9039585/mmsi:353430000/vessel:SEA%20AEOLIS/_:3525d580eade08cfdb72083b248185a9" title="View details for: SEA AEOLIS">SEA AEOLIS</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/1/port_name:PIRAEUS/_:3525d580eade08cfdb72083b248185a9" title="View details for: PIRAEUS">PIRAEUS</a>, <a href="/en/ais/details/ships/shipid:346713/imo:7614599/mmsi:273327300/vessel:SOLIDAT/_:3525d580eade08cfdb72083b248185a9" title="View details for: SOLIDAT">SOLIDAT</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/883/port_name:SEVASTOPOL/_:3525d580eade08cfdb72083b248185a9" title="View details for: SEVASTOPOL">SEVASTOPOL</a>, <a href="/en/ais/details/ships/shipid:752974/imo:9195298/mmsi:636011072/vessel:OCEANPRINCESS/_:3525d580eade08cfdb72083b248185a9" title="View details for: OCEANPRINCESS">OCEANPRINCESS</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/21780/port_name:EREGLI/_:3525d580eade08cfdb72083b248185a9" title="View details for: EREGLI">EREGLI</a>, <a href="/en/ais/details/ships/shipid:201260/imo:9385075/mmsi:235102768/vessel:EMERALD%20BAY/_:3525d580eade08cfdb72083b248185a9" title="View details for: EMERALD BAY">EMERALD BAY</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ships/shipid:418956/imo:9102746/mmsi:356579000/vessel:MSC%20DON%20GIOVANNI/_:3525d580eade08cfdb72083b248185a9" title="View details for: MSC DON GIOVANNI">MSC DON GIOVANNI</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/67/port_name:CONSTANTA/_:3525d580eade08cfdb72083b248185a9" title="View details for: CONSTANTA">CONSTANTA</a>, <a href="/en/ais/details/ships/shipid:748395/imo:9460734/mmsi:622121422/vessel:WADI%20SAFAGA/_:3525d580eade08cfdb72083b248185a9" title="View details for: WADI SAFAGA">WADI SAFAGA</a>, <a href="/en/ais/details/ports/767/port_name:NOVOROSSIYSK/_:3525d580eade08cfdb72083b248185a9" title="View details for: NOVOROSSIYSK">NOVOROSSIYSK</a>, <a href="/en/ais/details/ports/997/port_name:DAMIETTA/_:3525d580eade08cfdb72083b248185a9" title="View details for: DAMIETTA">DAMIETTA</a>

I want to pull out the strings that start with /en/ais/details/ships/shipid: such as:

<a href="/en/ais/details/ships/shipid:465271/imo:9495595/mmsi:373571000/vessel:SIDER%20LUCK/_:3525d580eade08cfdb72083b248185a9"

I was able to copy these examples (Find specific link w/ beautifulsoup or How to get Beautiful Soup to get link from href and class?) but I would rather not use regex.

So far I have:

for i in ase: #ase is where the html is sotred
    print(i.get('href')) #prints everysingle href. 

In short, my question is how do I only keep the href's that have the string I'm interested in without using regex?

Community
  • 1
  • 1
Rafael
  • 3,096
  • 1
  • 23
  • 61

3 Answers3

3

Try the following list comprehension:

[h.get('href') for h in ase if 'string' in h.get('href', '')] 

This will give you a list containing only the links that contain the substring 'string'.

Update:

As @PadraicCunningham pointed out in the comments, 'string' in h.get('href') (which was part of my original answer) will raise a TypeError if h does not have a key 'href' - not likely since h will be a representation of an <a> tag, but also certainly a non-trivial possibility. To allow for this possibility, you can simply pass to .get() a default argument of '' to be returned instead of None when a key does not exist.

Also, I have made no claims that my solution is the best; it is likely not particularly efficient or elegant. However, from my understanding of the OPs question, this solution will work, is minimal, and is easy to understand.

Community
  • 1
  • 1
elethan
  • 16,408
  • 8
  • 64
  • 87
  • wait, can you really do `element.get('attr')` instead of `element.attrs.get('attr')`?! that looks so much nicer! e: checked docs, and it's mentioned right there. don't know how i missed that for so long. – n1c9 Oct 06 '16 at 17:02
  • 1
    @n1c9 Nice, isn't it? There is a reason it is called "Beautiful"Soup, haha. – elethan Oct 06 '16 at 17:04
  • 1
    `'string' in None -> error`, you only need to use .get if you expect a href to not exist on the node you are calling it on, considering you can set `href=True` and also use a css selector, a regex etc.. there is no reason to ever need to resort to using .get, especially calling it twice. Also a substring being in a string is not the same as a string starting with a substring. – Padraic Cunningham Oct 07 '16 at 10:15
  • 1
    @PadraicCunningham thanks for pointing out the possibility of `.get()` returning `None` and raising an error here. This was a big oversight on my part, and I have updated my answer. – elethan Oct 07 '16 at 14:07
3

@elethan's answer is not the best one. It would find you all the links and only then filter them out. Why don't we just get the links we needed straight with no extra filtering - BeautifulSoup is very capable of that:

prefix = "/en/ais/details/ships/shipid"
[a["href"] for a in soup("a", href=lambda x: x and x.startswith(prefix))]

Or, instead of a function, you can pass a regular expression pattern to check if a string "starts with" a desired sub-string:

pattern = re.compile(r"^/en/ais/details/ships/shipid")
[a["href"] for a in soup("a", href=pattern)]

^ here denotes the beginning of a string.

Or, we can even use a CSS selector:

[a["href"] for a in soup.select('a[href^="/en/ais/details/ships/shipid"]')]

^= is a "starts-with" selector.

Graham
  • 7,431
  • 18
  • 59
  • 84
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
0

I would still advice you to use regex, as it is more concise and saves you another loop over the list.

import re
find_all('a', href=re.compile("/en/ais/details/ships/shipid:"))

In the documentation you find a similar solution to this.

imant
  • 597
  • 5
  • 15