0

EDIT 1: I'd like to extract video urls and titles from "https://ok.ru/video/c1404844" results using the CLI.

Here's want I've done so far :

The ERE pattern for each video relative URL is : /video/\d+ and the video absolute URL looks like this : https://ok.ru$videoRelativeURL

I can use this command to extract the video urls (I use uniq because many video IDs appear 3 times) :

$ curl -s https://ok.ru/video/c1404844 | grep -oP "/video/\d+" | uniq | sed "s|^|https://ok.ru|" | head -5
https://ok.ru/video/1896971373228
https://ok.ru/video/1896971438764
https://ok.ru/video/1896971569836
https://ok.ru/video/1896971635372
https://ok.ru/video/1898415590060

Then I tried extracting the video relativeURLs + title with pup.

EDIT 3 : I replaced the class name video-card_n ellip by video-card_n.ellip. However pup only outputs the attribute of the second class (video-card_n.ellip), strange :

$ curl -s https://ok.ru/video/c1404844 | pup '.video-card_lk attr{href}, .video-card_n.ellip attr{title}' | head -5
Death.in.Paradise.S02E05.WEBRip.x264-ION10
Death.in.Paradise.S02E02.WEBRip.x264-ION10
Death.in.Paradise.S02E04.WEBRip.x264-ION10
Death.in.Paradise.S02E03.WEBRip.x264-ION10
Death.in.Paradise.S02E06.WEBRip.x264-ION10

It didn't work so I converted the expanded html to json with this command :

$ curl -s https://ok.ru/video/c1404844 | pup 'json{}' > c1404844.json

Now I want to try and extract the title from video-card_n ellip and the href from video-card_lk from the resulting json file with the jq tool but I know how to use jq enough.

I'd like jq (or pup) to output a flat file : the url as the first column and the title as the second column.

EDIT 2 : A big thank you to @peak for his help on jq !

DONE :

$ curl -s https://ok.ru/video/c1404844 | pup 'json{}' | jq -r 'recurse | arrays[] | select(.class == "video-card_lk").href,select(.class == "video-card_n ellip").title' | awk '{videoRelativeURL = $0;url="https://ok.ru"gensub("?.*$","",videoRelativeURL); getline title; print url" # "title}' | head
https://ok.ru/video/1898417425068 # Death.in.Paradise.S02E05.WEBRip.x264-ION10
https://ok.ru/video/1898417359532 # Death.in.Paradise.S02E02.WEBRip.x264-ION10
https://ok.ru/video/1898417293996 # Death.in.Paradise.S02E04.WEBRip.x264-ION10
https://ok.ru/video/1898417228460 # Death.in.Paradise.S02E03.WEBRip.x264-ION10
https://ok.ru/video/1898417162924 # Death.in.Paradise.S02E06.WEBRip.x264-ION10
https://ok.ru/video/1898417097388 # Death.in.Paradise.S02E07.WEBRip.x264-ION10
https://ok.ru/video/1898417031852 # Death.in.Paradise.S02E08.WEBRip.x264-ION10
https://ok.ru/video/1898416966316 # Death.in.Paradise.S02E01.WEBRip.x264-ION10
https://ok.ru/video/1898416769708 # Death.in.Paradise.S07E02.The.Stakes.Are.High.WEBRip.x264-ION10
https://ok.ru/video/1898416704172 # Death.in.Paradise.S07E03.Written.in.Murder.WEBRip.x264-ION10
...
SebMa
  • 4,037
  • 29
  • 39
  • It may be possible to answer the `jq` portion of your question if you provide a short sample of `my_results__expanded_HTML.json` file. See [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) for guidance. – jq170727 Apr 17 '20 at 01:06
  • It seems that neither pup nor xidel supports spaces in class name selectors :-( – peak Apr 17 '20 at 06:36
  • @peak `pup` does if you replace spaces in class names by `.` – SebMa Apr 17 '20 at 21:06
  • @jq170727 I do unsterstand but the file is too big (~ 12M), do you how I can share it ? – SebMa Apr 17 '20 at 21:08
  • @SebMa - I think you're basically making my point, but a character such as _ that does not have special significance in pup might be better. – peak Apr 17 '20 at 22:17
  • @jq170727 Please see my EDIT 1 – SebMa Apr 18 '20 at 00:17

2 Answers2

1

After using to convert the HTML of the top-level page to JSON, the following jq filter produces 24 pairs, the first two of which are shown under "Output" below:

[ [ .. | arrays[] | select(.class == "video-card_n ellip").title],
  [ .. | arrays[] | select(.class == "video-card_lk").href]]
| transpose

Output


[
  [
    "Замечательная пара, красивая песня и чудесное исполнение! Золотые голоса!",
    "/video/2406311403450?st._aid=VideoState_open_top"
  ],
  [
    "#СидимДома",
    "/video/1675421949619?st._aid=VideoState_open_top"
  ],
  ...
peak
  • 105,803
  • 17
  • 152
  • 177
  • If you download the html file after scrolling down to the bottom of the results and save the expanded html with [Save Page WE](https://chrome.google.com/webstore/detail/save-page-we/dhhpefjklgkmgeafimnjhojgjamoafof), you should obtain 585 pairs – SebMa Apr 17 '20 at 21:01
  • Can you explain what is the meaning of `.. | arrays[]` ? – SebMa Apr 17 '20 at 21:40
  • @SebMa - Please check the jq manual. Also, the page scrolls indefinitely. It might be better to post a sample or synopsis of the HTML you think is typical. – peak Apr 17 '20 at 22:18
  • The `array` word is all over the place in the [manual](https://stedolan.github.io/jq/manual/). And I couldn't find this `arrays[]` notation in the [manual](https://stedolan.github.io/jq/manual/) so I tried to replace `arrays[]` by `dummy[]` to test if `arrays` was a keyword or a variable name. I'm lost here, can you help ? – SebMa Apr 17 '20 at 22:31
  • 1
    `arrays` appears in the title of a subsection describing the builtins: https://stedolan.github.io/jq/manual/#Builtinoperatorsandfunctions. You can use `builtins` to view the built-in functions. `arrays[]` is equivalent to `arrays | .[]` – peak Apr 17 '20 at 22:39
  • And `.. | arrays[]` is equivalent to `recurse(arrays[])` ? – SebMa Apr 17 '20 at 23:08
  • 1
    No. .. is equivalent to recurse (i.e. recurse/0). You can check the defs of builtins such as `arrays` by viewing builtin.jq – peak Apr 17 '20 at 23:28
1

If you want to scrape specific information from a HTML-source, then there's no need for 5 different tools! Please have a look at . It can do it all.

$ xidel -s https://ok.ru/video/c1404844 -e '
  //div[@data-id]/join(
    (
      div[@class="video-card_img-w"]/a/resolve-uri(substring-before(@href,"?")),
      div[@class="video-card_n-w"]/a
    ),
    " # "
  )
'
https://ok.ru/video/1898417425068 # Death.in.Paradise.S02E05.WEBRip.x264-ION10
https://ok.ru/video/1898417359532 # Death.in.Paradise.S02E02.WEBRip.x264-ION10
https://ok.ru/video/1898417293996 # Death.in.Paradise.S02E04.WEBRip.x264-ION10
https://ok.ru/video/1898417228460 # Death.in.Paradise.S02E03.WEBRip.x264-ION10
https://ok.ru/video/1898417162924 # Death.in.Paradise.S02E06.WEBRip.x264-ION10
https://ok.ru/video/1898417097388 # Death.in.Paradise.S02E07.WEBRip.x264-ION10
https://ok.ru/video/1898417031852 # Death.in.Paradise.S02E08.WEBRip.x264-ION10
https://ok.ru/video/1898416966316 # Death.in.Paradise.S02E01.WEBRip.x264-ION10
https://ok.ru/video/1898416769708 # Death.in.Paradise.S07E02.The.Stakes.Are.High.WEBRip.x264-ION10
https://ok.ru/video/1898416704172 # Death.in.Paradise.S07E03.Written.in.Murder.WEBRip.x264-ION10
[...]
Reino
  • 3,203
  • 1
  • 13
  • 21