3

I have HTML with data inside that I am trying to get matches for. I am using bash to achieve this and as its not possible to do I am running the HTML into PUP (as recommended here on StackOverflow), using PUP I am then extracting some of the schema however I am left with large json with data I dont need, I am then running sed commands to delete lines that I do not require. I am trying to find a way using JQ on only selecting the data I need so I dont need to run SED commands to delete unwanted lines.

So i run the command:-

cat test.html | pup 'div.scene json{}' > out.json

The below is generated.

 [
  {
   "children": [
    {
     "children": [
      {
       "class": "icon-new active",
       "tag": "div"
      },
      {
       "children": [
        {
         "children": [
          {
           "alt": "Album Title - Artist Name - 1",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "alt": "Album Title - Artist Name - 2",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "alt": "Album Title - Artist Name - 3",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "alt": "Album Title - Artist Name - 4",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "alt": "Album Title - Artist Name - 5",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "tag": "span"
          },
          {
           "tag": "span"
          },
          {
           "tag": "span"
          },
          {
           "tag": "span"
          },
          {
           "class": "last",
           "tag": "span"
          }
         ],
         "class": "sample-picker clearfix",
         "data-trackid": "bhangra-tracking-id",
         "href": "/bhangra/album/view/2842847/title-of-album/",
         "tag": "a",
         "title": "Album Title"
        }
       ],
       "class": "card-overlay",
       "tag": "div"
      },
      {
       "children": [
       {
         "alt": "Album Title",
         "class": "lazy card-main-img",
         "data-src": "",
         "tag": "img",
         "title": "Album Title"
        }
       ],
       "data-trackid": "bhangra-tracking-id  ",
       "href": "/bhangra/album/view/2842847/title-of-album/",
       "tag": "a",
       "title": "Album Title"
      }
     ],
     "class": "card-image",
     "tag": "div"
    },
    {
     "children": [
      {
       "children": [
        {
         "data-trackid": "scene-card-info-title Album Title ",
         "href": "/bhangra/album/view/2842847/title-of-album/",
         "tag": "a",
         "text": "Album Title",
         "title": "Album Title"
        }
       ],
       "class": "scene-card-title",
       "tag": "div"
      },
      {
       "children": [
        {
         "data-trackid": "scene-card-model name Artist Name modelid=1111 ",
         "href": "/bhangra/profile/view/2842847/artist-name/",
         "tag": "a",
         "text": "Artist Name",
         "title": "Artist Name"
        }
       ],
       "class": "model-names",
       "tag": "div"
      },
      {
       "tag": "time",
       "text": "September 08, 2018"
      },
      {
       "children": [
        {
         "children": [
          {
           "class": "label-left-box",
           "tag": "span",
           "text": "Website Name"
          },
          {
           "class": "label-text",
           "tag": "span",
           "text": "Website URL"
          }
         ],
         "class": "collection label-small",
         "data-trackid": "scene-card-collection",
         "href": "/bhangra/main/id/url/",
         "tag": "a",
         "title": "Website URL"
        },
        {
         "class": "label-hd ",
         "tag": "span"
        },
        {
         "children": [
          {
           "children": [
            {
             "class": "icons like-icon",
             "tag": "span"
            },
            {
             "class": "like-amount",
             "tag": "var",
             "text": "0"
            }
           ],
           "class": "likes",
           "tag": "span"
          },
          {
           "children": [
            {
             "class": "icons dislike-icon",
             "tag": "span"
            },
            {
             "class": "dislike-amount",
             "tag": "var",
             "text": "0"
            }
           ],
           "class": "dislikes",
           "tag": "span"
          }
         ],
         "class": "label-rating",
         "tag": "span"
        }
       ],
       "class": "bhangra-information",
       "tag": "div"
      }
     ],
     "class": "scene-card-info",
     "tag": "div"
    }
   ],
   "class": "bhangra-card scene ",
   "tag": "div"
  }
 ]

I am then using JQ to return some details I want.

 cat out.json | jq '.[] | {"1": .children[1].children[0].children, "2": .children[1].children[1].children, "date": .children[1].children[2].text}'

This is returning back the below.

 {
   "1": [
     {
       "data-trackid": "scene-card-info-title Album Title ",
       "href": "/bhangra/album/view/2842847/title-of-album/",
       "tag": "a",
       "text": "Album Title",
       "title": "Album Title"
     }
   ],
   "2": [
     {
       "data-trackid": "scene-card-model name Artist Name modelid=1111 ",
       "href": "/bhangra/profile/view/2842847/artist-name/",
       "tag": "a",
       "text": "Artist Name",
       "title": "Artist Name"
     }
   ],
   "date": "September 08, 2018"
 }

With the above the next Album2 also has key's of 1 & 2 followed by date, this results in the syntax being invalid and me not being able to target the data I want as the keys are all the same.

In order to fix this I am then running a bunch of sed commands to remove the lines that I don't need from the above.

The below is what I would like to be returned from my initial jq query but just unsure how I get this specific data returned.

 { 
   "1" : {
            "album": "Album Title",
            "href": "/bhangra/album/view/2842847/title-of-album/",
            "artist": "Artist Name",
            "date": "September 08, 2018"
   },
   "2" : {
            "album": "Album1 Title",
            "href": "/bhangra/album/view/2842847/title-of-album/",
            "artist": "Artist1 Name",
            "date": "September 08, 2018"
   },
   "3" : {
            "album": "Album2 Title",
            "href": "/bhangra/album/view/2842847/title-of-album/",
            "artist": "Artist2 Name",
            "date": "September 09, 2018"
   }
 }

UPDATE EDIT 11/09/2018

So I have made some slight progress on this, using the below query I have managed to pull back the data I require however they are still separate arrays.

 cat out.json | jq '.[] | .children[1].children[0].children[], .children[1].children[1].children[], .children[1].children[2] | {WTF: .title, href, text}'

This outputs the below which has got me slightly closer to what I want (above last example).

 {
   "WTF": "Album Title",
   "href": "/bhangra/album/view/2842847/title-of-album/",
   "text": "Album Title"
 }
   "WTF": "Artist Name",
   "href": "/bhangra/profile/view/2842847/artist-name/",
   "text": "Artist Name"
 }
 {
   "WTF": "Null",
   "href": "Null",
   "text": "September 08, 2018"
 }
Sukh
  • 85
  • 2
  • 9
  • by converting it to json you are making things more complicated – oguz ismail Sep 10 '18 at 20:34
  • Hmm the original request in html is [here](https://stackoverflow.com/questions/52239216/bash-read-html-find-div-based-on-two-different-variables) which could not be done in html using bash, so i've tried using pup/jq/ – Sukh Sep 10 '18 at 20:56
  • Inspired by this answer, here is how to see the thumbnail given a youtube url `curl $(curl https://www.youtube.com/watch\?v\=86CQq3pKSUw | pup 'meta[property="og:image"] json{}' | jq -r '.[].content') | feh -` – Lucas Alonso May 16 '23 at 00:55

1 Answers1

1

The connection between the input JSON and the JSON that is said to be the desired output seems tenuous, but one way to solve the problem of tagging the objects with sequentially-numbered keys is to use the following function:

def tag(s):
  reduce s as $x ({n:0, o:{}} ;
    .n += 1
    | .o += { (.n|tostring): $x})
  | .o;

Here, s should be a stream of JSON entities, and the result is a single object with keys "1", "2", etc.

So the task now is to produce a stream of the desired objects. Since it's unclear what you want, the following may be taken as illustrative.

{date: first(.. | objects | select(.tag == "time" and has("text")) | .text)} as $date
| tag(.. 
      | objects
      | select(has("title") and (has("children")|not) and .title == "Album Title")
      + $date )

Output

{
  "1": {
    "alt": "Album Title - Artist Name - 1",
    "class": "lazy image-under",
    "data-src": "",
    "tag": "img",
    "title": "Album Title",
    "date": "September 08, 2018"
  },
  "2": {
    "alt": "Album Title - Artist Name - 2",
    "class": "lazy image-under",
    "data-src": "",
    "tag": "img",
    "title": "Album Title",
    "date": "September 08, 2018"
  },
  "3": {
    "alt": "Album Title - Artist Name - 3",
    "class": "lazy image-under",
    "data-src": "",
    "tag": "img",
    "title": "Album Title",
    "date": "September 08, 2018"
  },
  "4": {
    "alt": "Album Title - Artist Name - 4",
    "class": "lazy image-under",
    "data-src": "",
    "tag": "img",
    "title": "Album Title",
    "date": "September 08, 2018"
  },
  "5": {
    "alt": "Album Title - Artist Name - 5",
    "class": "lazy image-under",
    "data-src": "",
    "tag": "img",
    "title": "Album Title",
    "date": "September 08, 2018"
  },
  "6": {
    "alt": "Album Title",
    "class": "lazy card-main-img",
    "data-src": "",
    "tag": "img",
    "title": "Album Title",
    "date": "September 08, 2018"
  },
  "7": {
    "data-trackid": "scene-card-info-title Album Title ",
    "href": "/bhangra/album/view/2842847/title-of-album/",
    "tag": "a",
    "text": "Album Title",
    "title": "Album Title",
    "date": "September 08, 2018"
  }
}
peak
  • 105,803
  • 17
  • 152
  • 177