I have HTML with data inside that I am trying to get matches for. I am using bash to achieve this and as its not possible to do I am running the HTML into PUP (as recommended here on StackOverflow), using PUP I am then extracting some of the schema however I am left with large json with data I dont need, I am then running sed commands to delete lines that I do not require. I am trying to find a way using JQ on only selecting the data I need so I dont need to run SED commands to delete unwanted lines.
So i run the command:-
cat test.html | pup 'div.scene json{}' > out.json
The below is generated.
[
{
"children": [
{
"children": [
{
"class": "icon-new active",
"tag": "div"
},
{
"children": [
{
"children": [
{
"alt": "Album Title - Artist Name - 1",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"alt": "Album Title - Artist Name - 2",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"alt": "Album Title - Artist Name - 3",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"alt": "Album Title - Artist Name - 4",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"alt": "Album Title - Artist Name - 5",
"class": "lazy image-under",
"data-src": "",
"tag": "img",
"title": "Album Title"
},
{
"tag": "span"
},
{
"tag": "span"
},
{
"tag": "span"
},
{
"tag": "span"
},
{
"class": "last",
"tag": "span"
}
],
"class": "sample-picker clearfix",
"data-trackid": "bhangra-tracking-id",
"href": "/bhangra/album/view/2842847/title-of-album/",
"tag": "a",
"title": "Album Title"
}
],
"class": "card-overlay",
"tag": "div"
},
{
"children": [
{
"alt": "Album Title",
"class": "lazy card-main-img",
"data-src": "",
"tag": "img",
"title": "Album Title"
}
],
"data-trackid": "bhangra-tracking-id ",
"href": "/bhangra/album/view/2842847/title-of-album/",
"tag": "a",
"title": "Album Title"
}
],
"class": "card-image",
"tag": "div"
},
{
"children": [
{
"children": [
{
"data-trackid": "scene-card-info-title Album Title ",
"href": "/bhangra/album/view/2842847/title-of-album/",
"tag": "a",
"text": "Album Title",
"title": "Album Title"
}
],
"class": "scene-card-title",
"tag": "div"
},
{
"children": [
{
"data-trackid": "scene-card-model name Artist Name modelid=1111 ",
"href": "/bhangra/profile/view/2842847/artist-name/",
"tag": "a",
"text": "Artist Name",
"title": "Artist Name"
}
],
"class": "model-names",
"tag": "div"
},
{
"tag": "time",
"text": "September 08, 2018"
},
{
"children": [
{
"children": [
{
"class": "label-left-box",
"tag": "span",
"text": "Website Name"
},
{
"class": "label-text",
"tag": "span",
"text": "Website URL"
}
],
"class": "collection label-small",
"data-trackid": "scene-card-collection",
"href": "/bhangra/main/id/url/",
"tag": "a",
"title": "Website URL"
},
{
"class": "label-hd ",
"tag": "span"
},
{
"children": [
{
"children": [
{
"class": "icons like-icon",
"tag": "span"
},
{
"class": "like-amount",
"tag": "var",
"text": "0"
}
],
"class": "likes",
"tag": "span"
},
{
"children": [
{
"class": "icons dislike-icon",
"tag": "span"
},
{
"class": "dislike-amount",
"tag": "var",
"text": "0"
}
],
"class": "dislikes",
"tag": "span"
}
],
"class": "label-rating",
"tag": "span"
}
],
"class": "bhangra-information",
"tag": "div"
}
],
"class": "scene-card-info",
"tag": "div"
}
],
"class": "bhangra-card scene ",
"tag": "div"
}
]
I am then using JQ to return some details I want.
cat out.json | jq '.[] | {"1": .children[1].children[0].children, "2": .children[1].children[1].children, "date": .children[1].children[2].text}'
This is returning back the below.
{
"1": [
{
"data-trackid": "scene-card-info-title Album Title ",
"href": "/bhangra/album/view/2842847/title-of-album/",
"tag": "a",
"text": "Album Title",
"title": "Album Title"
}
],
"2": [
{
"data-trackid": "scene-card-model name Artist Name modelid=1111 ",
"href": "/bhangra/profile/view/2842847/artist-name/",
"tag": "a",
"text": "Artist Name",
"title": "Artist Name"
}
],
"date": "September 08, 2018"
}
With the above the next Album2 also has key's of 1 & 2 followed by date, this results in the syntax being invalid and me not being able to target the data I want as the keys are all the same.
In order to fix this I am then running a bunch of sed commands to remove the lines that I don't need from the above.
The below is what I would like to be returned from my initial jq query but just unsure how I get this specific data returned.
{
"1" : {
"album": "Album Title",
"href": "/bhangra/album/view/2842847/title-of-album/",
"artist": "Artist Name",
"date": "September 08, 2018"
},
"2" : {
"album": "Album1 Title",
"href": "/bhangra/album/view/2842847/title-of-album/",
"artist": "Artist1 Name",
"date": "September 08, 2018"
},
"3" : {
"album": "Album2 Title",
"href": "/bhangra/album/view/2842847/title-of-album/",
"artist": "Artist2 Name",
"date": "September 09, 2018"
}
}
UPDATE EDIT 11/09/2018
So I have made some slight progress on this, using the below query I have managed to pull back the data I require however they are still separate arrays.
cat out.json | jq '.[] | .children[1].children[0].children[], .children[1].children[1].children[], .children[1].children[2] | {WTF: .title, href, text}'
This outputs the below which has got me slightly closer to what I want (above last example).
{
"WTF": "Album Title",
"href": "/bhangra/album/view/2842847/title-of-album/",
"text": "Album Title"
}
"WTF": "Artist Name",
"href": "/bhangra/profile/view/2842847/artist-name/",
"text": "Artist Name"
}
{
"WTF": "Null",
"href": "Null",
"text": "September 08, 2018"
}