0

I'm trying to further parse an output file I generated using an additional grep command. The code that I'm currently using is:

##!/bin/bash

# fetches the links of the movie's imdb pages for a given actor

# fullname="USER INPUT"
read -p "Enter fullname: " fullname

if [ "$fullname" = "Charlie Chaplin" ];
code="nm0000122"
then
code="nm0000050"
fi


curl "https://www.imdb.com/name/$code/#actor" | grep -Eo 
'href="/title/[^"]*' | sed 's#^.*href=\"/#https://www.imdb.com/#g' | 
sort -u > imdb_links.txt

#parses each of the link in the link text file and gets the details for 
each of the movie. THis is followed by the cleaning process
for i in $(cat imdb_links.txt) 
do 
   curl $i | 
   html2text | 
   sed -n '/Sign_In/,$p'|  
   sed -n '/YOUR RATING/q;p' | 
   head -n-1 | 
   tail -n+2 
done > imdb_all.txt

The sample generated output is:

EN
⁰
    * Fully supported
    * English (United States)
    * Partially_supported
    * Français (Canada)
    * Français (France)
    * Deutsch (Deutschland)
    * हिंदी (भारत)
    * Italiano (Italia)
    * Português (Brasil)
    * Español (España)
    * Español (México)
****** Duck Soup ******
    * 19331933
    * Not_RatedNot Rated
    * 1h 9m
IMDb RATING
7.8/10

How do I change the code to further parse the output to get only the data from the title of the movie up until the imdb rating ( in this case, the line that contains the title 'Duck Soup' up until the end.

vishal_P
  • 51
  • 2
  • 7
  • Run your code through http://shellcheck.net -- it has more problems than just the one you ask about. See [DontReadLinesWothFor](https://mywiki.wooledge.org/DontReadLinesWithFor) also. – Charles Duffy Apr 11 '22 at 18:55
  • Your command with `uniq` do not work properly, you should use `sort -u` instead. And all this line can be rewritten like this: `curl "https://www.imdb.com/name/$code/#actor" | grep -Eo 'href="/title/[^"]*' | sed 's#^.*href=\"/#https://www.imdb.com/#g' | sort -u > imdb_links.txt` – SergA Apr 11 '22 at 19:11
  • @SergA, thank you for the edit. Can you also help me with editing the sed line that uses 'Sign in', I'm trying to filter out the lines before the line that has the movie title in it. – vishal_P Apr 11 '22 at 19:21
  • (f/e, the string comparison should be `if [ "$fullname" = "Charlie Chaplain" ]; then` -- either the square brackets or the `test` command needs to be used, and the parameter expansion needs to be quoted, and there need to be spaces around the `=`, and for best compatibility it should be `=` not `==`) – Charles Duffy Apr 11 '22 at 19:29
  • @CharlesDuffy Can you please check if I've corrected it properly? also can you help me with the sed line that uses 'Sign in', I'm trying to filter out lines before the line that has the movie title in it. – vishal_P Apr 11 '22 at 19:38
  • It's just `[`, not `$[`, and you need the spaces between the `[` and `]` and arguments. `[` is a **command**, not a piece of syntax; you need spaces between it and its arguments just like you need spaces between any other shell command and its arguments. – Charles Duffy Apr 11 '22 at 19:41
  • 1
    As for the sed line, if I were going to help with that I'd be adding an answer rather than comments. I categorically disagree with using syntax-unaware tools to parse HTML, so I'm not willing to help someone do it. If you want to do it _right_, you should be using an HTML-aware toolchain. – Charles Duffy Apr 11 '22 at 19:42
  • (Python's `lxml.html` is a great choice; whereas if you're trying to stick with shell, there's `xmllint --html --xmlout` to convert into a format where XML-centric shell tools -- xmlstarlet, etc -- work. Trying to use `sed` to parse HTML or JSON is innately fragile -- next time IMDB reformats their HTML just a little your code is liable to break, even if the new file is semantically identical to the old one). – Charles Duffy Apr 11 '22 at 19:53

2 Answers2

1

Here is the code:

#!/bin/bash

# fullname="USER INPUT"
read -p "Enter fullname: " fullname

if [ "$fullname" = "Charlie Chaplin" ]; then
  code="nm0000122"
else
  code="nm0000050"
fi

rm -f imdb_links.txt

curl "https://www.imdb.com/name/$code/#actor" |
  grep -Eo 'href="/title/[^"]*' |
  sed 's#^href="#https://www.imdb.com#g' |
  sort -u |
while read link; do
   # uncomment the next line to save links into file:
   #echo "$link" >>imdb_links.txt

   curl "$link" |
     html2text -utf8 |
     sed -n '/Sign_In/,/YOUR RATING/ p' |
     sed -n '$d; /^\*\{6\}.*\*\{6\}$/,$ p'
done >imdb_all.txt
SergA
  • 1,097
  • 13
  • 21
1

Please(!) have a look at the following urls on why it's a really bad idea to parse HTML with sed:

The thing you're trying to do can be done with the HTML/XML/JSON parser and with just 1 call!
In this example I'll use the IMDB of Charlie Chaplin as source.

Extract all 94 "Actor" IMDB movie urls:

$ xidel -s "https://www.imdb.com/name/nm0000122" -e '
  //div[@id="filmo-head-actor"]/following-sibling::div[1]//a/@href
'
/title/tt0061523/?ref_=nm_flmg_act_1
/title/tt0050598/?ref_=nm_flmg_act_2
/title/tt0044837/?ref_=nm_flmg_act_3
[...]
/title/tt0004288/?ref_=nm_flmg_act_94

There's no need to save these to a text-file. Just use -f (--follow) instead of -e and xidel will open all of them.


For the individual movie urls you could parse the HTML to get the text-nodes you want...

$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
  //h1,
  //div[@class="sc-94726ce4-3 eSKKHi"]/ul/li[1]/span,
  //div[@class="sc-94726ce4-3 eSKKHi"]/ul/li[3],
  (//div[@class="sc-7ab21ed2-2 kYEdvH"])[1]
'
A Countess from Hong Kong
1967
2h
6.0/10

...but with those class-names I'd say that's a rather fragile endeavor. Instead I'd recommend to parse the JSON at the top of the HTML-source within the <script>-node:

$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
  parse-json(//script[@type="application/ld+json"])/(
    name,
    datePublished,
    duration,
    aggregateRating/ratingValue
  )
'
A Countess from Hong Kong
1967-03-15
PT2H
6

...or to get a similar output as above:

$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
  parse-json(//script[@type="application/ld+json"])/(
    name,
    year-from-date(date(datePublished)),
    substring(lower-case(duration),3),
    format-number(aggregateRating/ratingValue,"#.0")||"/10"
  )
'
A Countess from Hong Kong
1967
2h
6.0/10

All combined:

$ xidel -s "https://www.imdb.com/name/nm0000122" \
  -f '//div[@id="filmo-head-actor"]/following-sibling::div[1]//a/@href' \
  -e '
    parse-json(//script[@type="application/ld+json"])/(
      name,
      year-from-date(date(datePublished)),
      substring(lower-case(duration),3),
      format-number(aggregateRating/ratingValue,"#.0")||"/10"
    )
  '
A Countess from Hong Kong
1967
2h
6.0/10
A King in New York
1957
1h50m
7.0/10
Limelight
1952
2h17m
8.0/10
[...]
Making a Living
1914
11m
5.5/10
Reino
  • 3,203
  • 1
  • 13
  • 21