How can I get the web-scraped data into one dataframe?

Question

I am trying to pull all of the tables together. I can grab the first set of data which I think means that the scraping aspect works, however, I think there is an issue when I'm trying to bring all of it together.

I've tried to declare the dataframe early on and then have the table data fill it in every loop.

names = {'Iron-Man',
        'Incredible-Hulk-The',
        'Thor',
        'Iron-Man-2',
        'Captain-America-The-First-Avenger',
        'Avengers-The-(2012)',
        'Iron-Man-3',
        'Thor-The-Dark-World',
        'Captain-America-The-Winter-Soldier',
        'Guardians-of-the-Galaxy',
        'Avengers-Age-of-Ultron',
        'Ant-Man',
        'Captain-America-Civil-War',
        'Doctor-Strange-(2016)',
        'Guardians-of-the-Galaxy-Vol-2',
        'Spider-Man-Homecoming',
        'Thor-Ragnarok',
        'Black-Panther',
        'Avengers-Infinity-War',
        'Ant-Man-and-the-Wasp',
        'Captain-Marvel-(2019)',
        'Avengers-Endgame-(2019)'
         }

This piece of code works for grabbing the pages table

    data = requests.get('https://www.the-numbers.com/movie/Iron-Man#tab=box- office')
    soup = BeautifulSoup(data.text, 'html.parser')

    data = []

    div = soup.find('div' , {'id': 'box_office_chart'})
    table = div.find('table')
    tbody = table.find('tbody')
    html = table.encode().decode('utf8')
    dfs = pd.read_html(html,header=0)
    df = dfs[0]
    df

This piece of code is where I'm expecting it to loop through everything and grab it.

for name in names:
    print(name)
    data = requests.get('https://www.the-numbers.com/movie/' + name + '#tab=box-office')
    soup = BeautifulSoup(data.text, 'html.parser')
    div = soup.find('div' , {'id': 'box_office_chart'})
    table = div.find('table')
    tbody = table.find('tbody')
    html = table.encode().decode('utf8')
    dfs = pd.read_html(html,header=0)
    df2 = dfs[0]
    df2
    df.append(df2)
    print(name)
df

All of the movies printed out twice so I know that it at least went to each page. Here is the output which doesn't include any of the other movies.

df Output:

    Date            Rank    Gross           % Change    Theaters    Per Theaters    Total Gross     Week    movie
0   May 2, 2008     1       $102,118,668    NaN         4105        $24,877         $102,118,668    1       Iron-Man
1   May 9, 2008     1       $51,190,629     -50%        4111        $12,452         $177,825,024    2       Iron-Man
2   May 16, 2008    2       $31,838,996     -38%        4154        $7,665          $223,124,385    3       Iron-Man
3   May 23, 2008    3       $20,447,253     -36%        3915        $5,223          $252,614,669    4       Iron-Man
4   May 30, 2008    4       $13,541,264     -34%        3650        $3,710          $276,166,336    5       Iron-Man
5   Jun 6, 2008     6       $7,477,439      -45%        2931        $2,551          $288,847,640    6       Iron-Man
6   Jun 13, 2008    7       $5,620,375      -25%        2403        $2,339          $297,918,329    7       Iron-Man
7   Jun 20, 2008    9       $4,030,272      -28%        1912        $2,108          $304,816,141    8       Iron-Man
8   Jun 27, 2008    11      $2,257,113      -44%        1379        $1,637          $309,179,318    9       Iron-Man
9   Jul 4, 2008     12      $1,459,613      -35%        1019        $1,432          $311,708,133    10      Iron-Man
10  Jul 11, 2008    14      $939,134        -36%        710         $1,323          $313,421,025    11      Iron-Man
11  Jul 18, 2008    16      $451,838        -52%        375         $1,205          $314,376,968    12      Iron-Man
12  Jul 25, 2008    22      $310,654        -31%        274         $1,134          $314,925,955    13      Iron-Man
13  Aug 1, 2008     16      $580,179        +87%        407         $1,426          $315,687,768    14      Iron-Man
14  Aug 8, 2008     19      $426,502        -26%        45          $1,236          $316,468,817    15      Iron-Man
15  Aug 15, 2008    23      $341,178        -20%        315         $1,083          $317,058,295    16      Iron-Man
16  Aug 22, 2008    29      $243,342        -29%        257         $947            $317,473,452    17      Iron-Man
17  Aug 29, 2008    33      $223,636        -8%         220         $1,017          $317,794,156    18      Iron-Man
18  Sep 5, 2008     38      $126,734        -43%        205         $618            $318,006,770    19      Iron-Man
19  Sep 12, 2008    39      $94,816         -25%        156         $608            $318,134,740    20      Iron-Man
20  Sep 19, 2008    43      $59,037         -38%        124         $476            $318,219,154    21      Iron-Man
21  Sep 26, 2008    48      $58,364         -1%         121         $482            $318,298,180    22      Iron-Man

I keep expecting to have all of the tables from the other pages added to df. Not sure where I'm going wrong.

EDIT: So I got rid of the first attempt at grabbing data and just used a bunch of elif statements to create all 22 dataframes. Thanks to everyone for the suggestions.

One thing: the method `append` in pandas does not work like on a `list`, so if you want to use it, the line of code `df.append(df2)` should be `df = df.append(df2)`, you need to reassign `df` each time see [this answer](https://stackoverflow.com/a/37009377/9274732). Now it is not a best practice, so you could create a list with all the `df2` and then use `concat`, see for example [this answer](https://stackoverflow.com/a/37009561/9274732) — Ben.T, May 02 '19 at 02:00
Possible duplicate of [Using pandas .append within for loop](https://stackoverflow.com/questions/37009287/using-pandas-append-within-for-loop) — Ben.T, May 02 '19 at 02:01

score 1 · Answer 1 · answered May 02 '19 at 09:56

No need to do all the elif statements. To append the current df from your loop into a final results df, you need to use df = df.append(df2).

import requests
import pandas as pd
from bs4 import BeautifulSoup

names = {'Iron-Man',
        'Incredible-Hulk-The',
        'Thor',
        'Iron-Man-2',
        'Captain-America-The-First-Avenger',
        'Avengers-The-(2012)',
        'Iron-Man-3',
        'Thor-The-Dark-World',
        'Captain-America-The-Winter-Soldier',
        'Guardians-of-the-Galaxy',
        'Avengers-Age-of-Ultron',
        'Ant-Man',
        'Captain-America-Civil-War',
        'Doctor-Strange-(2016)',
        'Guardians-of-the-Galaxy-Vol-2',
        'Spider-Man-Homecoming',
        'Thor-Ragnarok',
        'Black-Panther',
        'Avengers-Infinity-War',
        'Ant-Man-and-the-Wasp',
        'Captain-Marvel-(2019)',
        'Avengers-Endgame-(2019)'
         }

df = pd.DataFrame()
for name in names:
    print(name)
    url = 'https://www.the-numbers.com/movie/' + name + '#tab=box-office'
    data = requests.get(url)
    soup = BeautifulSoup(data.text, 'html.parser')
    div = soup.find('div' , {'id': 'box_office_chart'})
    table = div.find('table')
    tbody = table.find('tbody')
    html = table.encode().decode('utf8')
    dfs = pd.read_html(html,header=0)
    df2 = dfs[0]
    df2['movie'] = name
    df = df.append(df2)
    print(name)
df = df.reset_index(drop=True)

Output:

print (df)
             Date Rank  ... Week                          movie
0     Mar 8, 2019    1  ...    1          Captain-Marvel-(2019)
1    Mar 15, 2019    1  ...    2          Captain-Marvel-(2019)
2    Mar 22, 2019    2  ...    3          Captain-Marvel-(2019)
3    Mar 29, 2019    3  ...    4          Captain-Marvel-(2019)
4     Apr 5, 2019    5  ...    5          Captain-Marvel-(2019)
5    Apr 12, 2019    6  ...    6          Captain-Marvel-(2019)
6    Apr 19, 2019    4  ...    7          Captain-Marvel-(2019)
7    Apr 26, 2019    2  ...    8          Captain-Marvel-(2019)
8    Apr 27, 2018    1  ...    1          Avengers-Infinity-War
9     May 4, 2018    1  ...    2          Avengers-Infinity-War
10   May 11, 2018    1  ...    3          Avengers-Infinity-War
11   May 18, 2018    2  ...    4          Avengers-Infinity-War
12   May 25, 2018    3  ...    5          Avengers-Infinity-War
13    Jun 1, 2018    4  ...    6          Avengers-Infinity-War
14    Jun 8, 2018    5  ...    7          Avengers-Infinity-War
15   Jun 15, 2018    8  ...    8          Avengers-Infinity-War
16   Jun 22, 2018    9  ...    9          Avengers-Infinity-War
17   Jun 29, 2018   12  ...   10          Avengers-Infinity-War
18    Jul 6, 2018   15  ...   11          Avengers-Infinity-War
19   Jul 13, 2018   16  ...   12          Avengers-Infinity-War
20   Jul 20, 2018   20  ...   13          Avengers-Infinity-War
21   Jul 27, 2018   21  ...   14          Avengers-Infinity-War
22    Aug 3, 2018   24  ...   15          Avengers-Infinity-War
23   Aug 10, 2018   26  ...   16          Avengers-Infinity-War
24   Aug 17, 2018   37  ...   17          Avengers-Infinity-War
25   Aug 24, 2018   42  ...   18          Avengers-Infinity-War
26   Aug 31, 2018   44  ...   19          Avengers-Infinity-War
27    Sep 7, 2018   52  ...   20          Avengers-Infinity-War
28   Apr 26, 2019    1  ...    1        Avengers-Endgame-(2019)
29    May 5, 2017    1  ...    1  Guardians-of-the-Galaxy-Vol-2
..            ...  ...  ...  ...                            ...
367   Aug 1, 2008   16  ...   14                       Iron-Man
368   Aug 8, 2008   19  ...   15                       Iron-Man
369  Aug 15, 2008   23  ...   16                       Iron-Man
370  Aug 22, 2008   29  ...   17                       Iron-Man
371  Aug 29, 2008   33  ...   18                       Iron-Man
372   Sep 5, 2008   38  ...   19                       Iron-Man
373  Sep 12, 2008   39  ...   20                       Iron-Man
374  Sep 19, 2008   43  ...   21                       Iron-Man
375  Sep 26, 2008   48  ...   22                       Iron-Man
376   Jul 7, 2017    1  ...    1          Spider-Man-Homecoming
377  Jul 14, 2017    2  ...    2          Spider-Man-Homecoming
378  Jul 21, 2017    3  ...    3          Spider-Man-Homecoming
379  Jul 28, 2017    5  ...    4          Spider-Man-Homecoming
380   Aug 4, 2017    6  ...    5          Spider-Man-Homecoming
381  Aug 11, 2017    7  ...    6          Spider-Man-Homecoming
382  Aug 18, 2017    7  ...    7          Spider-Man-Homecoming
383  Aug 25, 2017    7  ...    8          Spider-Man-Homecoming
384   Sep 1, 2017    7  ...    9          Spider-Man-Homecoming
385   Sep 8, 2017    7  ...   10          Spider-Man-Homecoming
386  Sep 15, 2017    9  ...   11          Spider-Man-Homecoming
387  Sep 22, 2017   11  ...   12          Spider-Man-Homecoming
388  Sep 29, 2017   18  ...   13          Spider-Man-Homecoming
389   Oct 6, 2017   20  ...   14          Spider-Man-Homecoming
390  Oct 13, 2017   20  ...   15          Spider-Man-Homecoming
391  Oct 20, 2017   27  ...   16          Spider-Man-Homecoming
392  Oct 27, 2017   33  ...   17          Spider-Man-Homecoming
393   Nov 3, 2017   37  ...   18          Spider-Man-Homecoming
394  Nov 10, 2017   42  ...   19          Spider-Man-Homecoming
395  Nov 17, 2017   46  ...   20          Spider-Man-Homecoming
396  Nov 24, 2017   51  ...   21          Spider-Man-Homecoming

[397 rows x 9 columns]

@Spence if the solution was what you needed, be sure to accept the solution above by clicking on the "check" — chitown88, May 30 '19 at 14:24

How can I get the web-scraped data into one dataframe?

1 Answers1