How can I only select links from "#0-9 and A-Z" in BeautifulSoup?

Question

my URL is this

https://en.wikipedia.org/wiki/List_of_South_Korean_dramas

This works well in selecting all links from for A to Z.

 link = s.get(url)
    link_soup = BeautifulSoup(link.text, 'lxml')
    links = (
        link_soup
        .select_one('#A')
        .parent
        .find_next_sibling("ul")
        .find_all("a", href=True)
    )

But when I try to select_one #0-9

....

 link_soup
        .select_one('#0-9')
        .parent
        .find_next_sibling("ul")
        .find_all("a", href=True)
    )

I get this error

SelectorSyntaxError: Malformed id selector at position 0
  line 1:
#0-9
^

How can I select only the links from "#0-9 and A-Z"? I know I can just use a for loop and use re to change the ending of the URL and manually scrape the links from there but is there a way to get the same results using select or bs4.

Thanks again for the help.

html ids must start with a letter - perhaps the lxml parser is too strict. You could try html.parser instead and see if you have better luck. On valid ids refer to: https://stackoverflow.com/questions/70579/what-are-valid-values-for-the-id-attribute-in-html — topsail, Jun 19 '22 at 02:43

QHarr · Accepted Answer · 2022-06-19T05:43:39.233

To answer the direct question you can use an attribute = value css selector to specify the id attribute and its value. The numbers are within "" and so do not pose an issue to the parser.

link_soup.select('[id="0-9"]')

Or escape the leading digit using its Unicode code point (no following space needed in this case and can be abbreviated to \30)

link_soup.select('#\\30-9')

However, you could specify a single pattern to extract all links in one go and without the additional up down walking of the DOM.

links = ['https://en.wikipedia.org' + i['href'] for i in link_soup.select('h2:not(:has(#See_also)) + ul a')]

How can I only select links from "#0-9 and A-Z" in BeautifulSoup?

1 Answers1