1

I have the following code:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import requests 

from requests import get

date = []
tourney_round = []
result = []
winner_odds = []
loser_odds = []
surface = []
players_and_tourney

response = get('http://www.tennisexplorer.com/player/humbert-e2553/?annual=all')

page_html = BeautifulSoup(response.text, 'html.parser')

results2018_containers = page_html.find_all('div', id = 'matches-2018-1-data')

for container in results2018_containers:
played_date_2018 = results2018_containers[0].findAll('td', class_ = 'first time')
for i in played_date_2018:
            date.append(i.text)

string_2018 = '2018'
date = [x + string_2018 for x in date]

for container in results2018_containers:
rounds_2018 = results2018_containers[0].findAll('td', class_ = 'round')
for i in rounds_2018:
            tourney_round.append(i.text)

for container in results2018_containers:
results_2018 = results2018_containers[0].findAll('td', class_ = 'tl')
for i in results_2018:
            result.append(i.text)

for container in results2018_containers:
surfaces_2018 = results2018_containers[0].findAll('td', class_ = 's-color')
for i in surfaces_2018:
            surface.append(i.find('span')['title'])

for container in results2018_containers:
odds_2018 = results2018_containers[0].findAll('td', class_ = 'course')

winner_odds_2018 = odds_2018[0:][::2]
for i in winner_odds_2018:
    winner_odds.append(i.text)

loser_odds_2018 = odds_2018[1:][::2]
for i in loser_odds_2018:
    loser_odds.append(i.text)

for container in results2018_containers:
namesandtourney_2018 = results2018_containers[0].findAll('td', class_ = 't-name')
for i in namesandtourney_2018:
            players_and_tourney.append(i.text)

from itertools import chain, groupby, repeat

chainer = chain.from_iterable

def condition(x):
return x.startswith('\xa0')

elements = [list(j) for i, j in groupby(players_and_tourney, key=condition) if not i]

# create list of headers
headers = [next(j) for i, j in groupby(players_and_tourney, key=condition) if i]

# chain list of lists, and use repeat for headers
initial_df_2018 = pd.DataFrame({'Date': date,
                'Surface': surface,
                'Players': list(chainer(elements)),
                'Tournament': list(chainer(repeat(i, j) for i, j in \
                         zip(headers, map(len, elements)))),
                'Round': tourney_round,
                'Result': result,
                'Winner Odds': winner_odds,
                'Loser Odds' : loser_odds})

initial_df_2018['Winner'], initial_df_2018['Loser'] = 
initial_df_2018['Players'].str.split(' - ', 1).str
del initial_df_2018['Players']

initial_df_2018 = initial_df_2018[['Date','Surface','Tournament','Winner','Loser','Result','Winner Odds','Loser Odds']]

I want to create a loop that runs the code for every year starting from 2005. So basically, running a loop by replacing 2018 throughout the code by each year between 2005 an 2018. If possible, the code would run first for the year 2018, then 2017, and so on until 2005.

Edit: I added the code that i used to pull data for the year 2018, but I want to have a loop that will pull data for all the years that can be found on the page.

Sd Junk
  • 272
  • 3
  • 15
  • can you give the `site` and more information ? – Druta Ruslan Jun 10 '18 at 12:03
  • Please add more information. What response you will get from and what the response text will contain etc.? – hygull Jun 10 '18 at 12:03
  • @RishikeshAgrawani I am trying to scrap this page for example : http://www.tennisexplorer.com/player/humbert-e2553/?annual=all . I know how to scrap each elements like date, tournament, rounds and so on for each year (say 2018), but I would want a loop to allow me to scrap all years that appear on the page. – Sd Junk Jun 10 '18 at 14:41

4 Answers4

2

If I understood you correctly you want to complete the request for 2018, for all years between 2005-2018.

What I did was loop over your code for years in those range, replacing the id each time and adding all data to the list.

response = get('http://www.example.com')

page_html = BeautifulSoup(response.text, 'html.parser')
date_dict = {}

for year in range(2019, 1, -1):
    date = []
    string_id = "played-{}-data".format(year)
    results_containers = page_html.find_all('div', id = string_id)

    if (results_containers == None):
        continue
    for container in results_containers :
        played_date = results_containers [0].findAll('td', class_ = 'plays')
        for i in played_date :
            date.append(i.text)
    if not (year in date_dict):
        date_dict[year] = []
    date_dict[year] += date
Rohi
  • 814
  • 1
  • 9
  • 26
  • Loop should run for years from 2018 to 2005 not from 2005 to 2018. – hygull Jun 10 '18 at 12:15
  • @RishikeshAgrawani fixed. – Rohi Jun 10 '18 at 12:16
  • Why create the range and reverse it when it is more efficient to just create it how you want it the first time? Start, stop before and step. Step can be negative. – Zev Jun 10 '18 at 12:26
  • @Zev You are completely correct its just that I already wrote the code and it was the fastest fix, also I find it more readable (but that might be just me). – Rohi Jun 10 '18 at 12:31
  • It is very readable that way. It does not matter for this case, but for very long ranges would we hit an issue or is Python smart enough to do this as quickly? – Zev Jun 10 '18 at 12:35
  • @Zev O(n), but it really doesn't have much to do with his request, just wanted to make it easy for him to understand :) – Rohi Jun 10 '18 at 12:37
  • @Rohi Thank you for providing the code. It is working for pages when data goes as far back as 2005. However on some pages, data only starts at other years like 2012 or so, and results_containers returns nothing when it is the case. Any way to work around that? Also, I wanted to have played_date_2018 because the dates on the data do not feature the year so I wanted to be able to add the year to each element of each yearly list before putting them together in one big list. – Sd Junk Jun 10 '18 at 14:04
  • @SdJunk You shouldnt have an issue when the data returned is none(the loop will not run). If you want to link data to year, you could use a dict and add the data to the dict (the key would be the year). – Rohi Jun 10 '18 at 14:06
  • @SdJunk Updated. – Rohi Jun 10 '18 at 14:09
  • @Rohi Actually, I want the loop to run for every year from 2018 going into the furthest year it can find in the past on one page. That's why I chose 2005 because it's the furthest all data on the site goes back to. And on some pages it only goes back to years like 2012 or so. – Sd Junk Jun 10 '18 at 14:19
  • @SdJunk I fixed it so it would keep running until no data is received from the server. (If the container == None). – Rohi Jun 10 '18 at 14:25
  • @Rohi Thank you, but now it still returns nothing. Maybe I will try to find another way to scrap the data. – Sd Junk Jun 10 '18 at 14:38
  • @SdJunk Do you mean the results container is always empty? – Rohi Jun 10 '18 at 15:00
  • @Rohi Yes. The results container is always empty with the last edit you provided. I added the script I used to scrap the data for the year 2018 in the initial post. Maybe you could take a look at it to understand better what I am trying to do. – Sd Junk Jun 10 '18 at 19:08
2

You can store the year as an integer but still use it in a string.

for year in range(2018, 2004, -1):
    print(f"Happy New Year {year}")

Other ways to include a number in a string are "Happy New Year {}".format(year) or "it is now " + str(year) + " more text".

Also, I don't think you do, but if someone finds this and really wants to "iterate a string" caesar ciphers are a good place to look.

Zev
  • 3,423
  • 1
  • 20
  • 41
1

There's no problem looping that, but you need to define how you want your results. I used a dictionary here, and i've turned your code into a function that I can call with variables:

def get_data(year):
    date =[]

    response = get('http://www.example.com')

    page_html = BeautifulSoup(response.text, 'html.parser')

    results_containers = page_html.find_all('div', id = 'played-{year}-data'.format(year))

    for container in results_containers:
        played_date = results_containers[0].findAll('td', class_ = 'plays')
        for i in played_date:
            date.append(i.text)

    return date

Now all i have to do is create a range of possible years and call the function every time, this can be done as simply as:

all_data = {year: get_data(year) for year in range(2018, 2004, -1)}
Ofer Sadan
  • 11,391
  • 5
  • 38
  • 62
1

Just use a for loop over a range. Something like:

date =[]

response = get('http://www.example.com')

page_html = BeautifulSoup(response.text, 'html.parser')

for year in range(2018, 2004, -1):
   year_id = 'played-{}-data'.format(year)
   results_containers = page_html.find_all('div', id=year_id)

   ...
Gelineau
  • 2,031
  • 4
  • 20
  • 30