Shell script is not correctly scraping body text from web page

Question

I am working on a shell script where a user can input the IMDb numeric code of a movie (EX: 0076759 corresponds to "Star Wars: A New Hope") from the movie's page URL on the site. My intention with the program is that if the user executes the script: bash search_movie 0076759, the output is as follows:

Star Wars: Episode IV - A New Hope (1977)
    Luke Skywalker joins forces with a...[Rest of Plot Summary Text here]

This is my current script below:

#!/usr/bin/bash

# moviedata--Given a movie or TV title, returns a list of matches. If the user
# specifies an IMDb numeric index number, however, returns the synopsis of
# the film instead.

# Remember to install lynx with command: sudo yum install lynx

titleurl="http://www.imdb.com/title/tt"
imdburl="http://www.imdb.com/find?s=tt&exact=true&ref_=fn_tt_ex&q="
tempout="/tmp/moviedata.$$"

# Produce a synopsis of the film.
summarize_film() {    
    grep "<title>" $tempout | sed 's/<[^>]*>//g;s/(more)//'
    grep --color=never -A2 '<h5>Plot:' $tempout | tail -1 | \
    cut -d\< -f1 | fmt | sed 's/^/ /'
    exit 0
}

trap "rm -f $tempout" 0 1 15

if [ $# -eq 0 ] ; then
 echo "Usage: $0 {movie title | movie ID}" >&2
 exit 1
fi

# Checks whether we're asking for a title by IMDb title number
nodigits="$(echo $1 | sed 's/[[:digit:]]*//g')"
if [ $# -eq 1 -a -z "$nodigits" ] ; then
 lynx -source "$titleurl$1/combined" > $tempout
 summarize_film
 exit 0
fi

# It's not an IMDb title number, search for titles.

fixedname="$(echo $@ | tr ' ' '+')" # for the URL
url="$imdburl$fixedname"
lynx -source $imdburl$fixedname > $tempout

# No results:

fail="$(grep --color=never '<h1 class="findHeader">No ' $tempout)"

# If more than one matching title found:

if [ ! -z "$fail" ] ; then
    echo "Failed: no results found for $1"
    exit 1
elif [ ! -z "$(grep '<h1 class="findHeader">Displaying' $tempout)" ] ; then
    grep --color=never '/title/tt' $tempout | \
    sed 's/</\
</g' | \
    grep -vE '(.png|.jpg|>[ ]*$)' | \
    grep -A 1 "a href=" | \
    grep -v '^--$' | \
    sed 's/<a href="\/title\/tt//g;s/<\/a> //' | \
    awk '(NR % 2 == 1) { title=$0 } (NR % 2 == 0) { print title " " $0 }' | \
    sed 's/\/.*>/: /' | \
    sort
fi

exit 0

When executing the script, the output gets to the relevant movie page successfully but it does not return the plot summary and also outputs a mess of website tracker info as well

I would greatly appreciate it if I could get some insight into what I'm doing wrong in my script.

which part of `html` page do you want to parse? just `title` or some others ? — Shakiba Moshiri, Mar 15 '21 at 06:37
To parse html use html aware tool. Use xmllint or xmlstarlet. [Do not use regex to parse html](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — KamilCuk, Mar 15 '21 at 08:24

Shakiba Moshiri · Accepted Answer · 2021-03-15T07:22:48.087

First of all parsing html page with regex is not the right way to do it, using appropriate parser is a better choice.

Second your script can be much more simpler,

have the list of desire tags you want to parse
loop over those tags to extract text
do what ever you want with text you have saved.

here is a simple one-liner to parse <title and <script>

for tag in title script; do lynx -source "http://www.imdb.com/title/tt0076759" | perl -lne "/(?<=<$tag>).*?(?=<)/ && print $&"; done

output

Star Wars: Episode IV - A New Hope (1977) - IMDb
(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);
(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);
(function(t){ (t.events = t.events || {})["csm_head_pre_icon"] = new Date().getTime(); })(IMDbTimer);
(function(t){ (t.events = t.events || {})["csm_head_post_icon"] = new Date().getTime(); })(IMDbTimer);
(function(t){ (t.events = t.events || {})["csm_head_pre_css"] = new Date().getTime(); })(IMDbTimer);
(function(t){ (t.events = t.events || {})["csm_head_post_css"] = new Date().getTime(); })(IMDbTimer);
(function(t){ (t.events = t.events || {})["csm_head_pre_ads"] = new Date().getTime(); })(IMDbTimer);

or using an array

#!/usr/bash
html_tag=(title script)
for tag in ${html_tag[@]}; do
    lynx -source "http://www.imdb.com/title/tt0076759" | \
         perl -lne "/(?<=<$tag>).*?(?=<)/ && print $&"
done

here I used perl because it has more feature with regex.

NOTE, if you save the page on disk and then parse it, it would be much better and simple to parse. here is a simple one:

# save on disk
lynx -source "http://www.imdb.com/title/tt0076759" > html
# match those two parts you wnat
perl -lne '$/=undef; print $& while /(?:(?<=<title>)|(?<="summary_text">))[^<]+/g' html

output:

Star Wars: Episode IV - A New Hope (1977) - IMDb

                Luke Skywalker joins forces with a Jedi Knight, a cocky pilot, a Wookiee and two droids to save the galaxy from the Empire's world-destroying battle station, while also attempting to rescue Princess Leia from the mysterious Darth Vader.

Shell script is not correctly scraping body text from web page

1 Answers1