1

I am web-scraping and trying to append the first link out of a list of links (using list comprehension) and I am having trouble. I have looked through a lot of posts which get me close, but not quite. I either get an error (shown below) or all links(not just the first from each URL). I have tried solutions shown here, but get a different error around Navigable String. Please see below for previous my code, the error, and my ideal output. Thank you for any help!

Code

dfkf['URL'][0:5].values = 
      ['https://www.sec.gov/Archives/edgar/data/867028/0001493152-19-010877-index.htm',
       'https://www.sec.gov/Archives/edgar/data/1438901/0001161697-19-000350-index.htm',
       'https://www.sec.gov/Archives/edgar/data/1750/0001047469-19-004266-index.htm',
       'https://www.sec.gov/Archives/edgar/data/1138723/0001564590-19-032909-index.htm',
       'https://www.sec.gov/Archives/edgar/data/1650101/0001493152-19-009992-index.htm']


x = []
for URL in dfkf['URL'][0:5].values:
    r = requests.get(str(URL))
    soup = BeautifulSoup(r.text, 'html.parser')
    x.append([line['href'] for line in list(soup.find_all(text = re.compile('xml'), href=True))][0])

Error IndexError: list index out of range

Ideal Output (first link out of the list of returned links)

  x= ['/Archives/edgar/data/867028/000149315219010877/etfm-20181231.xml',
  [],
  '/Archives/edgar/data/1750/000104746919004266/air-20190531.xml',
  '/Archives/edgar/data/1138723/000156459019032909/aray-20190630.xml',
  '/Archives/edgar/data/1650101/000149315219009992/atxg-20190331.xml']
JJAN
  • 777
  • 1
  • 7
  • 13

1 Answers1

1

There was no need for the list comprehension:

for URL in dfkf['URL'][0:5].values:
    r = requests.get(str(URL))
    soup = BeautifulSoup(r.text, 'html.parser')
    links = soup.find_all(text=re.compile('xml'), href=True)
    if links:
        x.append(links[0]['href'])
    else:
        x.append(list())

Edit: Probably better to do x.append(None) than x.append(list()) unless you really need an empty list in your results.

FiddleStix
  • 3,016
  • 20
  • 21
  • 1
    Thank you very much! This works great. I was not familiar with using "if len(foo):" to identify if something exists and the seemingly more recent "if foo:" as per the style guide recommendations - https://www.python.org/dev/peps/pep-0008/ – JJAN Oct 22 '19 at 17:40
  • 1
    Oh yeah, I didn't realise you could just do `if my_list:` and skip the `len()`. Answer updated. – FiddleStix Oct 23 '19 at 09:53