6

I am looking to extract email, phone and name value from the below code in SCRIPT tag(not in Body) using Beautiful soup(Python). I see Beautiful soup can be used for extracting.

I tried getting page using the following code -

fileDetails = BeautifulSoup(urllib2.urlopen('http://www.example.com').read())
results = fileDetails.find(email:")

This Ajax request code is not repeating in the page again. Can we also write try and catch so that if it doesn't found it in the page, it won't throw any error.

<script type="text/javascript" language='javascript'> 
$(document).ready( function (){
   
   $('#message').click(function(){
       alert();
   });

    $('#addmessage').click(function(){
        $.ajax({ 
            type: "POST",
            url: 'http://www.example.com',
            data: { 
                email: 'abc@g.com', 
                phone: '9999999999', 
                name: 'XYZ'
            }
        });
    });
});

Once I get this, I also want to store in an excel file.

Thanks in anticipation.

mrj
  • 849
  • 2
  • 8
  • 18
Chopra
  • 571
  • 4
  • 8
  • 23
  • Either provide an actual url, or the relevant `script` tag contents (better url, if possible). – alecxe Aug 04 '14 at 04:28
  • This is the example. I think, it's easy to find the text from HTML file. – Chopra Aug 04 '14 at 04:34
  • I'm sorry for bothering, but this is not clear. How can we help if we cannot see where exactly the desired data is inside the HTML? Please be more specific. The better you ask, the more chances of a good and fast answer you have at the topic. – alecxe Aug 04 '14 at 04:35
  • I am sorry for that. It's in ajax script. Script is posted in the question. I am looking to extract 'abc@g.com','9999999999' and 'XYZ'. – Chopra Aug 04 '14 at 04:38

2 Answers2

8

Alternatively to the regex-based approach, you can parse the javascript code using slimit module, that builds an Abstract Syntax Tree and gives you a way of getting all assignments and putting them into the dictionary:

from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor


data = """
<html>
    <head>
        <title>My Sample Page</title>
        <script>
        $.ajax({
            type: "POST",
            url: 'http://www.example.com',
            data: {
                email: 'abc@g.com',
                phone: '9999999999',
                name: 'XYZ'
            }
        });
        </script>
    </head>
    <body>
        <h1>What a wonderful world</h1>
    </body>
</html>
"""

# get the script tag contents from the html
soup = BeautifulSoup(data)
script = soup.find('script')

# parse js
parser = Parser()
tree = parser.parse(script.text)
fields = {getattr(node.left, 'value', ''): getattr(node.right, 'value', '')
          for node in nodevisitor.visit(tree)
          if isinstance(node, ast.Assign)}

print fields

Prints:

{u'name': u"'XYZ'", u'url': u"'http://www.example.com'", u'type': u'"POST"', u'phone': u"'9999999999'", u'data': '', u'email': u"'abc@g.com'"}

Among other fields, there are email, name and phone that you are interested in.

Hope that helps.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
2

You can get the script tag contents via BeautifulSoup and then apply a regex to get the desired data.

Working example (based on what you've described in the question):

import re
from bs4 import BeautifulSoup

data = """
<html>
    <head>
        <title>My Sample Page</title>
        <script>
        $.ajax({
            type: "POST",
            url: 'http://www.example.com',
            data: {
                email: 'abc@g.com',
                phone: '9999999999',
                name: 'XYZ'
            }
        });
        </script>
    </head>
    <body>
        <h1>What a wonderful world</h1>
    </body>
</html>
"""

soup = BeautifulSoup(data)
script = soup.find('script')

pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, script.text))
print fields['email'], fields['phone'], fields['name']

Prints:

abc@g.com 9999999999 XYZ

I don't really like the solution, since that regex approach is really fragile. All sorts of things can happen that would break it. I still think there is a better solution and we are missing a bigger picture here. Providing a link to that specific site would help a lot, but it is what it is.


UPD (fixing the code OP provided):

soup = BeautifulSoup(data, 'html.parser')
script = soup.html.find_next_sibling('script', text=re.compile(r"\$\(document\)\.ready"))

pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, script.text))
print fields['email'], fields['phone'], fields['name']

prints:

abcd@gmail.com 9999999999 Shamita Shetty
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • The code is working like a charm. I got a syntax error. This small x is creating a problem in the following code: It says, SyntaxError: Non-ASCII character '\xc3' in file import.py on line 17, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details – Chopra Aug 04 '14 at 05:12
  • Added this # -- coding: utf-8 --import os and it's working fine. – Chopra Aug 04 '14 at 05:14
  • hey @alecxe, I have updated the script. I am getting an error now. It is because – Chopra Aug 04 '14 at 05:22
  • @Chopra what error? I've pasted that script code you've provided into my example and it works as before. – alecxe Aug 04 '14 at 05:24
  • I just added – Chopra Aug 04 '14 at 05:26
  • could you please help me out with the question. I have searched for this solution but couldn't found it. I would appreciate it. – Chopra Aug 05 '14 at 14:30
  • @Chopra ok, let's revisit this. Sorry, but posting a duplicate question is not the way to go really. Ok, show the html code containing that `script` tag. Ideally, the whole `head` part of the html (or link to the web-site). Thanks. – alecxe Aug 05 '14 at 14:31
  • I am sorry for that. I thought, I'll get a reply of this question. Here is the link of the script. https://www.dropbox.com/s/lpe1nlxli3a0pn4/import.py – Chopra Aug 05 '14 at 14:43
  • @Chopra this is totally a game changer since the `script` tag lives outside the `html`. Though there is a solution. Please see the `UPD` section. Works for me. – alecxe Aug 05 '14 at 14:59
  • This works for me. Thanks @alecxe. Is the following code okay for handling exceptions: try: print fields['email'], fields['phone'], fields['name'] except: pass – Chopra Aug 05 '14 at 15:20