I am trying to pull all of the tables together. I can grab the first set of data which I think means that the scraping aspect works, however, I think there is an issue when I'm trying to bring all of it together.
I've tried to declare the dataframe early on and then have the table data fill it in every loop.
names = {'Iron-Man',
'Incredible-Hulk-The',
'Thor',
'Iron-Man-2',
'Captain-America-The-First-Avenger',
'Avengers-The-(2012)',
'Iron-Man-3',
'Thor-The-Dark-World',
'Captain-America-The-Winter-Soldier',
'Guardians-of-the-Galaxy',
'Avengers-Age-of-Ultron',
'Ant-Man',
'Captain-America-Civil-War',
'Doctor-Strange-(2016)',
'Guardians-of-the-Galaxy-Vol-2',
'Spider-Man-Homecoming',
'Thor-Ragnarok',
'Black-Panther',
'Avengers-Infinity-War',
'Ant-Man-and-the-Wasp',
'Captain-Marvel-(2019)',
'Avengers-Endgame-(2019)'
}
This piece of code works for grabbing the pages table
data = requests.get('https://www.the-numbers.com/movie/Iron-Man#tab=box- office')
soup = BeautifulSoup(data.text, 'html.parser')
data = []
div = soup.find('div' , {'id': 'box_office_chart'})
table = div.find('table')
tbody = table.find('tbody')
html = table.encode().decode('utf8')
dfs = pd.read_html(html,header=0)
df = dfs[0]
df
This piece of code is where I'm expecting it to loop through everything and grab it.
for name in names:
print(name)
data = requests.get('https://www.the-numbers.com/movie/' + name + '#tab=box-office')
soup = BeautifulSoup(data.text, 'html.parser')
div = soup.find('div' , {'id': 'box_office_chart'})
table = div.find('table')
tbody = table.find('tbody')
html = table.encode().decode('utf8')
dfs = pd.read_html(html,header=0)
df2 = dfs[0]
df2
df.append(df2)
print(name)
df
All of the movies printed out twice so I know that it at least went to each page. Here is the output which doesn't include any of the other movies.
df Output:
Date Rank Gross % Change Theaters Per Theaters Total Gross Week movie
0 May 2, 2008 1 $102,118,668 NaN 4105 $24,877 $102,118,668 1 Iron-Man
1 May 9, 2008 1 $51,190,629 -50% 4111 $12,452 $177,825,024 2 Iron-Man
2 May 16, 2008 2 $31,838,996 -38% 4154 $7,665 $223,124,385 3 Iron-Man
3 May 23, 2008 3 $20,447,253 -36% 3915 $5,223 $252,614,669 4 Iron-Man
4 May 30, 2008 4 $13,541,264 -34% 3650 $3,710 $276,166,336 5 Iron-Man
5 Jun 6, 2008 6 $7,477,439 -45% 2931 $2,551 $288,847,640 6 Iron-Man
6 Jun 13, 2008 7 $5,620,375 -25% 2403 $2,339 $297,918,329 7 Iron-Man
7 Jun 20, 2008 9 $4,030,272 -28% 1912 $2,108 $304,816,141 8 Iron-Man
8 Jun 27, 2008 11 $2,257,113 -44% 1379 $1,637 $309,179,318 9 Iron-Man
9 Jul 4, 2008 12 $1,459,613 -35% 1019 $1,432 $311,708,133 10 Iron-Man
10 Jul 11, 2008 14 $939,134 -36% 710 $1,323 $313,421,025 11 Iron-Man
11 Jul 18, 2008 16 $451,838 -52% 375 $1,205 $314,376,968 12 Iron-Man
12 Jul 25, 2008 22 $310,654 -31% 274 $1,134 $314,925,955 13 Iron-Man
13 Aug 1, 2008 16 $580,179 +87% 407 $1,426 $315,687,768 14 Iron-Man
14 Aug 8, 2008 19 $426,502 -26% 45 $1,236 $316,468,817 15 Iron-Man
15 Aug 15, 2008 23 $341,178 -20% 315 $1,083 $317,058,295 16 Iron-Man
16 Aug 22, 2008 29 $243,342 -29% 257 $947 $317,473,452 17 Iron-Man
17 Aug 29, 2008 33 $223,636 -8% 220 $1,017 $317,794,156 18 Iron-Man
18 Sep 5, 2008 38 $126,734 -43% 205 $618 $318,006,770 19 Iron-Man
19 Sep 12, 2008 39 $94,816 -25% 156 $608 $318,134,740 20 Iron-Man
20 Sep 19, 2008 43 $59,037 -38% 124 $476 $318,219,154 21 Iron-Man
21 Sep 26, 2008 48 $58,364 -1% 121 $482 $318,298,180 22 Iron-Man
I keep expecting to have all of the tables from the other pages added to df. Not sure where I'm going wrong.
EDIT: So I got rid of the first attempt at grabbing data and just used a bunch of elif statements to create all 22 dataframes. Thanks to everyone for the suggestions.