4

I have some page parsed with beautiful soup. But there I have js code :

<script type="text/javascript">   


var utag_data = {
            customer_id   : "_PHL2883198554", 
            customer_type : "New",
            loyalty_id : "N",
            declined_loyalty_interstitial : "false",
            site_version  : "Desktop Site",
            site_currency: "de_DE_EURO",
            site_region: "uk",
            site_language: "en-GB",


            customer_address_zip : "",
            customer_email_hash :  "",
            referral_source :  "",
            page_type : "product",
            product_category_name : ["Lingerie"],
            product_category_id :[jQuery("meta[name=defaultParent]").attr("content")],
            product_id : ["5741462261401"],
            product_image_url : ["http://images.urbanoutfitters.com/is/image/UrbanOutfitters/5741462261401_001_b?$detailmain$"],
            product_brand : ["Pretty Polly"],
            product_selling_price : ["20.0"],
            promo_id : "6",
            product_referral : ["WOMENS-SHAPEWEAR-LINGERIE-SOLUTIONS-EU"],
            product_name : ["Pretty Polly Shape It Up Tummy Shaping Camisole"],
            is_online_only : true,
            is_back_in_stock : false
}
</script>

How can I get some values from this input? Should I work with this example like with text? I mean write it to some variable and split and then take some data?

Thanks

user3761151
  • 143
  • 1
  • 7

1 Answers1

5

Once you have the text of the script via

js_text = soup.find('script', type="text/javascript").text

for example. Then you can use regex to find the data, I'm sure there is an easier way to do this but regex shouldn't be hard as well.

import re
regex =  re.compile('\n^(.*?):(.*?)$|,', re.MULTILINE) #compile regex
js_text = re.findall(regex, js_text) #  find first item @ new line to : and 2nd item @ from : to the end of the line or , 
js_text = [jt.strip() for jt in js_text] #  to strip away all of the extra white space.

this will return a list of names and values in name|value|name2|value2... order which you can mess around with or convert to dictionary later on.

Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
  • @user3761151 Add re.MULTILINE flag, forgot to mention that. Edited my answer. You can find full documentation how to use regex in Python here: https://docs.python.org/3.4/library/re.html – Granitosaurus Jun 22 '14 at 09:19
  • but if I need such string : this.products = ko.observableArray([{"productId":537477, ... elemets }]) to get, is it possible to make regex for it? – user3761151 Jun 22 '14 at 11:08
  • @user3761151 I'm having trouble understanding what you actually need here, but with regex you can pretty much extract anything you want from the string you get. Knowing regex is vital doing any string management work so I would highly recommend dedicating an evening or two to learn it. – Granitosaurus Jun 22 '14 at 14:20