using beautifulsoup to find_all in an array only returning first few results

Question

I've successfully used BeautifulSoup to iterate through a few hundred pages of the bandsintown webpage, viewed here: https://www.bandsintown.com/?came_from=257&page=102

I'm able to iterate through each page to create an array of all event dates, called "uniqueDatesBucket". Printing the array gives the me following, seen below (there are many results, I've included a sample below).

print uniqueDatesBucket

Result:

  [[<div class="event-b58f7990"><div class="event-ad736269">JAN</div><div class="event-d7a00339">08</div></div>, <div class="event-b58f7990"><div class="event-ad736269">JAN</div><div class="event-d7a00339">08</div></div>, ............................<div class="event-b58f7990"><div class="event-ad736269">JAN</div><div class="event-d7a00339">31</div></div>]]

This is as expected. I then want to place the Month and Day in separate arrays, in order to start building a database of dates. Here's the code:

#Build empty array for month/date
uniqueMonth = []
uniqueDay = []

for i in uniqueDatesBucket[0]:
    uniqueMonthDay = i.find_all('div')

    uniqueMonth.append(uniqueMonthDay[0].text)
    uniqueDay.append(uniqueMonthDay[1].text)

print uniqueDay

The result is:

[u'08', u'08', u'08', u'08', u'08', u'08', u'08', u'08', u'08', u'09', u'09', u'09', u'09', u'09', u'09', u'09', u'09', u'09']

My question is, why is this only returning 18 results (there are 18 events on the landing page of the bandsintown page, but I thought I solved this using the page iterator described previously)? There are clearly more than 18 results shown in the uniqueDatesBucket element, which is the parent of uniqueMonth array.

Also, what is the "u" before each date in the results?

About second part of the question, those are u strings. See [here](https://stackoverflow.com/questions/599625/python-string-prints-as-ustring). — 0xInfection, Jan 09 '19 at 05:01
interesting- thanks. I'll check into that. It sounds like a formatting issue. Any thoughts on why the returned data would be incomplete? — DiamondJoe12, Jan 09 '19 at 05:20
I don't think that's correct. If you look closely, the array is a list of lists, hence the double [[ ]]. Hence, all data from all pages should be in the second set of brackets, i.e. position 0. — DiamondJoe12, Jan 09 '19 at 05:27
DYZ - thanks for the help - Hopefully I described it decently above. It's structured like so: [[
JAN
08
, etc, etc, etc,.........]] So, it's just one item inside the inner list. Correct me if I'm wrong. — DiamondJoe12, Jan 09 '19 at 05:33
It us impossible to tell from the printout if the list has one item or more. I strongly suggest that you look at `len(uniqueDatesBucket)`. You may be surprised (or maybe not, but better safe than sorry). — DYZ, Jan 09 '19 at 06:53

Nick de Silva · Answer 1 · 2019-01-09T23:01:38.733

I've tried my best to replicate your code, but I'm not getting very far. The link you have provided doesn't give me the same output so I can't try and perfectly replicate it.

Using your list that you provided, I hit no issues when I ran it myself:

x = '<div class="event-b58f7990"><div class="event-ad736269">JAN</div><div class="event-d7a00339">08</div></div>, <div class="event-b58f7990"><div class="event-ad736269">JAN</div><div class="event-d7a00339">08</div></div>, <div class="event-b58f7990"><div class="event-ad736269">JAN</div><div class="event-d7a00339">31</div></div>'.split(', ')
x

That gives me the following:

['<div class="event-b58f7990"><div class="event-ad736269">JAN</div><div class="event-d7a00339">08</div></div>',
 '<div class="event-b58f7990"><div class="event-ad736269">JAN</div><div class="event-d7a00339">08</div></div>',
 '<div class="event-b58f7990"><div class="event-ad736269">JAN</div><div class="event-d7a00339">31</div></div>']

Here's what I did to replicate it:

uniqueDatesBucket = []
uniqueMonth = []
uniqueDay = []

for item in x:
    uniqueDatesBucket.append(BeautifulSoup(item, 'html.parser'))

for i in uniqueDatesBucket:
    uniqueMonthDay = i.find_all('div')
    print('Day:\t' + uniqueMonthDay[2].text + '\tMonth:\t', uniqueMonthDay[1].text)

Here's my output:

Day:    08  Month:   JAN
Day:    08  Month:   JAN
Day:    31  Month:   JAN

Note that the indexes are different to what you were using to get the same thing, hence the confusion.

However, if you're scraping from the site you provided, everything was embedded in a JavaScript section, which makes it much easier to parse through and get the correct values. Here's my code to steal it from the JSON embedded in the script:

import requests
from bs4 import BeautifulSoup
import json
import re # regular expression, I just use it to extract the JSON from the JavaScript

x = requests.get('https://www.bandsintown.com/?came_from=257&page=102')

soup = BeautifulSoup(x.content, 'html.parser')

json_text = soup.find_all('script')[2].text # Gives you a JSON set to the valirable window.__data
json_extracted = re.search(r'^window.__data=(.+)', json_text).group(1) # Collect the JSON without variable assigning
json_parsed = json.loads(json_extracted)

# The dates are being hidden in json.homeView.body.popularEvents.events
for item in json_parsed['homeView']['body']['popularEvents']['events']:
    print(item['artistName'])
    print('Playing on', item['dayOfWeek'], item['dayOfMonth'], item['month'], '\n')

Here's the output:

Florence and The Machine 
Playing on FRI 18 JAN 

Maroon 5
Playing on FRI 22 FEB 

Shawn Mendes
Playing on TUE 29 OCT 

John Mayer
Playing on WED 27 MAR 

Amy Shark
Playing on SAT 11 MAY 

Post Malone
Playing on TUE 30 APR 

John Butler Trio
Playing on THU 07 FEB 

Florence and The Machine 
Playing on SAT 19 JAN 

Ocean Alley
Playing on THU 14 MAR 

Bring Me the Horizon
Playing on SAT 13 APR

As for the u'xyz' strings, that's because BeautifulSoup can output the string as unicode (which is what the u stands for). You can fix this by going u'xyz'.decode('utf-8').

score 0 · Answer 2 · answered Jan 09 '19 at 08:48

from my understanding your problem is not parsing html but processing the data or list.

from your code:

for i in uniqueDatesBucket[0]:

it seem you're only loop first index, did you mean want to loop all?

for udb in uniqueDatesBucket:
    for i in udb:
        uniqueMonthDay = i.find_all('div')

        uniqueMonth.append(uniqueMonthDay[0].text)
        uniqueDay.append(uniqueMonthDay[1].text)

using beautifulsoup to find_all in an array only returning first few results

2 Answers2