4

I am having trouble in selecting this 'div' object in Beautiful Soup and then parsing the data within.

First I have to decode the HTML entities like the function on this website (https://mothereff.in/html-entities).

What steps would I take to, for example, programmatically select

(extraLarge:'/s3/fhphotos/CIRD-72K6-H9_SID_1.jpg,width=1000&height=1000&mode=max')

from the code below

<div data-bind="component: { name: &#39;product-detail&#39;, params: {hasVariants:true,name:&#39;BROOKS LOUNGE CHAIR&#39;,hasCategory:true,superCategoryName:&#39;Furniture&#39;,categoryDisplayName:&#39;Living Room&#39;,categorySlug:&#39;living-room&#39;,subcategoryDisplayName:&#39;Chairs&#39;,subcategorySlug:&#39;chairs&#39;,collection:{id:1529,name:&#39;Irondale&#39;,description:&#39;Each piece is a striking conversation-starter. Tables are made from reclaimed doors paired with salvaged architecture or old machine parts. Storage solutions are inspired by libraries of the 1940’s. Cast iron beds with linen panels as well as seating in linen, lush velvet and top-grain leather offer a distinctive found feel.&#39;,isFeatured:true,isNew:false,image:&#39;/FourHandsMarketplace/media/General/Featured%20Collections/IRONDALE.jpg?width=500&#39;,shortDescription:&#39;Moving from Parisian flea market to modern to industrial, understated elegance is a common theme. Waxed leathers and distressed irons mix with fabrics for an intriguing style blend.\r\n&#39;,uri:&#39;/collections/irondale&#39;},attributes:[{id:384,name:&#39;COVER&#39;,displayOrder:30,swatches:true,values:[{id:12710,name:&#39;EBONY&#39;,displayOrder:1,swatchUrl:&#39;/s3/fhphotos/Y C11458-G6_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;},{id:12711,name:&#39;STONEWASH DARK GREEN&#39;,displayOrder:2,swatchUrl:&#39;/s3/fhphotos/Y C11458-H9_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;}]},{id:385,name:&#39;FINISH&#39;,displayOrder:40,swatches:true,values:[{id:12712,name:&#39;BLACK WASH WEATHERED&#39;,displayOrder:1,swatchUrl:&#39;/s3/fhphotos/Y C11458-K5_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;},{id:12713,name:&#39;DISTRESSED WASHED OLD OAK&#39;,displayOrder:2,swatchUrl:&#39;/s3/fhphotos/Y C11458-K6_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;}]}],products:[{attributeValueIds:[12710,12712],description:&#39;Our take on the classic Adirondack emphasizes comfort with thick, top-grain leather cushioning. Wire-brushed oak is finished in black and hand-distressed for a naturally weathered patina.&#39;,dimensions:&#39;W: 27.75&quot; H: 29&quot; D: 34.75&quot;&#39;,availabilityDescription:&#39;&lt;strong>Quantity in Stock: &lt;/strong>&lt;span >88&lt;/span>&lt;br />&lt;strong>More on the Way: &lt;/strong>&lt;span >Yes&lt;/span>&lt;br />&lt;strong>Estimated Arrival Date: &lt;/strong>&lt;span >1 to 2 weeks&lt;/span>&#39;,colors:[&#39;Black Washed Weathered&#39;,&#39;Ebony&#39;],weightPounds:45.0,volumeCubicFeet:18.72,images:[{order:1,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_PRM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_PRM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_PRM_1.jpg&#39;},{order:2,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_1.jpg&#39;},{order:3,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_2.jpg&#39;},{order:4,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_DET_1.jpg&#39;},{order:5,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_DET_2.jpg&#39;},{order:6,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_BCK_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_BCK_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_BCK_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_BCK_1.jpg&#39;},{order:7,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_FRT_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_FRT_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_FRT_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_FRT_1.jpg&#39;},{order:8,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_SID_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_SID_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_SID_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_SID_1.jpg&#39;},{order:9,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_3.jpg&#39;},{order:10,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_DET_3.jpg&#39;},{order:11,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_4.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_4.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_4.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_4.jpg&#39;}],priceHtml:&#39;$520.00&#39;,itemNumber:&#39;CIRD-72K5-G6H6&#39;,name:&#39;Brooks Lounge Chair-Ebony, Blk Wsh Weath&#39;,availableForImmediateShipment:true,isNew:false,isCloseout:false},{attributeValueIds:[12711,12713],description:&#39;Our take on the classic Adirondack emphasizes comfort with green, stonewashed cotton canvas cushioning. Wire-brushed oak is hand-distressed for a naturally weathered patina.&#39;,dimensions:&#39;W: 27.75&quot; H: 29&quot; D: 34.5&quot;&#39;,availabilityDescription:&#39;&lt;strong>Quantity in Stock: &lt;/strong>&lt;span >147&lt;/span>&lt;br />&lt;strong>More on the Way: &lt;/strong>&lt;span >Yes&lt;/span>&lt;br />&lt;strong>Estimated Arrival Date: &lt;/strong>&lt;span >1 to 2 weeks&lt;/span>&#39;,colors:[&#39;Distressed Washed Old Oak&#39;,&#39;Stonewash Dark Green&#39;],weightPounds:45.0,volumeCubicFeet:18.72,images:[{order:1,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_PRM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_PRM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_PRM_1.jpg&#39;},{order:2,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_ROM_1.jpg&#39;},{order:3,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_ROM_2.jpg&#39;},{order:4,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_DET_1.jpg&#39;},{order:5,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_DET_2.jpg&#39;},{order:6,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_BCK_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_BCK_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_BCK_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_BCK_1.jpg&#39;},{order:7,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_FRT_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_FRT_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_FRT_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_FRT_1.jpg&#39;},{order:8,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_SID_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_SID_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_SID_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_SID_1.jpg&#39;},{order:9,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_ROM_3.jpg&#39;},{order:10,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_DET_3.jpg&#39;}],priceHtml:&#39;$290.00&#39;,itemNumber:&#39;CIRD-72K6-H9&#39;,name:&#39;Brooks Lounge Chair-Stonewsh Drk Green&#39;,availableForImmediateShipment:true,isNew:false,isCloseout:false}],activeItemNumber:&#39;CIRD-72K5-G6H6&#39;,priceDescription:&#39;Wholesale Price&#39;} }"></div>

?

Matt Leung
  • 49
  • 4

1 Answers1

1

It is not entirely clear where this html-string comes from and what exactly you are interested in extracting, but for the Beautiful Soup part you simply need:

soup = BeautifulSoup(s)
text = soup.div['data-bind']

where s is the string in your question. We first get hold of the 'div' tag before getting the 'data-bind' attribute.

The format confuses me as it is similar to json and similar to a python dictionary, but none of those parsers liked the input. I guess its javascript? I wrote a quick and dirty parenthesis counting loop inspired by this question:

nest_lvl = 0
lvl_string = list()
for char in text:
    if char == '{':
        nest_lvl += 1
    elif char == '}':
        nest_lvl -= 1

    try:
        lvl_string[nest_lvl] += char
    except IndexError:          # first iter
        lvl_string.append(char)

    if char == '}':
        print nest_lvl, lvl_string[nest_lvl]
        lvl_string[nest_lvl] = ''

which will hopefully get you started. Again, the parsing part really depends on how general the parser needs to be and what exactly you want to extract.

Community
  • 1
  • 1
oystein
  • 1,507
  • 1
  • 11
  • 13