1

I'm new to Python, just get started with it today.
My system environment are Python 3.5 with some libraries on Windows10.

I want to extract football player data from site below as CSV file.

Problem: I can not extract data from soup.find_all('script')[17] to my expected CSV format. How to extract those data as I want ?

My code is shown as below.

from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen

req = Request('http://www.futhead.com/squad-building-challenges/squads/343', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser') #not sure if i need to use lxml
soup.find_all('script')[17] #My target data is in 17th

My expected output would be similar to this

position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik
nisahc
  • 23
  • 1
  • 4

2 Answers2

0

So my understanding is that beautifulsoup is better for HTML parsing, but you are trying to parse javascript nested in the HTML.

So you have two options

  1. Simply create a function that takes the result of soup.find_all('script')[17], loop and search the string manually for the data and extract it. You can even use ast.literal_eval(string_thats_really_a_dictionary) to make it even easier. This is may not be the best a approach but if you are new to python you might want to do it this just for practice.
  2. Use the json library like in this example. or alternatively like this way. This is probably the better way to do.
0

As @josiah Swain said, it's not going to be pretty. For this sort of thing it's more recommended to use JS as it can understand what you have.

Saying that, python is awesome and here is you solution!

#Same imports as before
from bs4 import BeautifulSoup
import re
from urllib.request import Request, urlopen

#And one more
import json

# The code you had 
req = Request('http://www.futhead.com/squad-building-challenges/squads/343',
               headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage,'html.parser')

# Store the script 
script = soup.find_all('script')[17]

# Extract the oneline that stores all that JSON
uncleanJson = [line for line in script.text.split('\n') 
         if line.lstrip().startswith('squad.register_players($.parseJSON') ][0]

# The easiest way to strip away all that yucky JS to get to the JSON
cleanJSON = uncleanJson.lstrip() \
                       .replace('squad.register_players($.parseJSON(\'', '') \
                       .replace('\'));','')

# Extract out that useful info
data = [ [p['position'],p['data']['slot_position'],p['data']['slug']] 
         for p in json.loads(cleanJSON)
         if p['player'] is not None]


print('position,slot_position,slug')
for line in data:
    print(','.join(line))

The result I get for copying and pasting this into python is:

position,slot_position,slug
ST,ST,paulo-henrique
LM,LM,mugdat-celik
CAM,CAM,soner-aydogdu
RM,RM,petar-grbic
GK,GK,fatih-ozturk
CDM,CDM,eray-ataseven
LB,LB,kadir-keles
CB,CB,caner-osmanpasa
CB,CB,mustafa-yumlu
RM,RM,ioan-adrian-hora
GK,GK,bora-kork

Edit: On reflection this is not the easiest code to read for a beginner. Here is a easier to read version

# ... All that previous code 
script = soup.find_all('script')[17]

allScriptLines = script.text.split('\n')

uncleanJson = None
for line in allScriptLines:
     # Remove left whitespace (makes it easier to parse)
     cleaner_line = line.lstrip()
     if cleaner_line.startswith('squad.register_players($.parseJSON'):
          uncleanJson = cleaner_line

cleanJSON = uncleanJson.replace('squad.register_players($.parseJSON(\'', '').replace('\'));','')

print('position,slot_position,slug')
for player in json.loads(cleanJSON):
     if player['player'] is not None:
         print(player['position'],player['data']['slot_position'],player['data']['slug']) 
Stefan Collier
  • 4,314
  • 2
  • 23
  • 33
  • This is works out so well, Thank you so much for your time to explain me about how to solve this problem. After read your code, It's not going to be easy for beginner who just start learning Python. – nisahc Oct 11 '17 at 05:55