0

I'm working to create a crawler that makes a .csv file to list the titles and URLs of every page at a domain. It seems to crawl the site nicely, but then when it tries to parse the titles I get an error:

line 38: syntax error near unexpected token `done'

Can anyone help me to eliminate the cause of this error?

Here's the script I'm using

#!/bin/bash  
# Crawls a domain   
# Retreives all visible URLs and their page titles  
# Saves to CSV   

# USAGE:
# save this script as, say, “spiderbot.sh”. 
# Then “chmod +x spiderbot.sh”. 
# Then run the script passing in the name of this site you want to crawl, for example ` ./spiderbot.sh http://www.someSite.com `           

# Text color variables   
txtund=$(tput sgr 0 1) 

# Underline   
txtbld=$(tput bold) 

# Bold   
bldred=${txtbld}$(tput setaf 1) # red   
bldblu=${txtbld}$(tput setaf 4) # blue   
bldgreen=${txtbld}$(tput setaf 2) # green   
bldwht=${txtbld}$(tput setaf 7) # white   
txtrst=$(tput sgr0) # Reset   
info=${bldwht}*${txtrst} # Feedback   
pass=${bldblu}*${txtrst}   
warn=${bldred}*${txtrst}   
ques=${bldblu}?${txtrst}           
printf "%s=== Crawling $1 ===  %s" "$bldgreen" "$txtrst"          

# wget in Spider mode, outputs to wglog file   
# -R switch to ignore specific file types (images, javascript etc.)   
wget --reject-regex "(.*)\?(.*)" --no-check-certificate --spider -r -l inf -w .25 -nc -nd $1 -R bmp,css,gif,ico,jpg,jpeg,js,mp3,mp4,pdf,png,swf,txt,xml,xls,zip 2>&1 | tee wglog v          
printf " %s========================================== " "$bldgreen"   
printf "%s=== Crawl Finished... ===%s " "$bldgreen" "$txtrst"   
printf "%s=== Begin retreiving page titles... ===%s " "$bldgreen" "$txtrst"   
printf "%s==========================================  " "$bldgreen"           
printf "%s** Run tail -f $1.csv for progress%s  " "$bldred" "$txtrst"   

# from wglog, grab URLs   
# curl each URL and grep title   
cat wglog | grep '^--' | awk '{print $3}' | sort | uniq | while read url; 
  do {  printf "%s* Retreiving title for: %s$url%s " "$bldgreen" "$txtrst$txtbld" "$txtrst"   printf ""${url}","`curl -# ${url} | sed -n -e 's!.*<title>(.*)</title>.*!1!p'`" " >> $1.csv  printf " "}; 
  done           

# clean up log file   
rm wglog  
exit
Dylan Kinnett
  • 241
  • 3
  • 15
  • 1
    The fact that you link out to your script means that this question will become useless for others if you move your script. It's better to post the problem code here. – user1675642 Oct 12 '17 at 20:55
  • (or, rather, to post *the smallest possible code that generates the same problem*, as described in the Help Center article on building a [mcve]) – Charles Duffy Oct 12 '17 at 21:06
  • For me, that is the smallest possible example because I'm new to shell scripts and unsure what I can remove without causing further errors. – Dylan Kinnett Oct 12 '17 at 21:08
  • @DylanKinnett, ...well, it's pretty clear that you don't need any of your `printf`s, for example -- if you take them out, the error message stays the same. And you clearly can't need any code that happens *after* the bug takes place. And if a variable is only used by the `printf`s, then you can't need its assignments. If in doubt as to whether you've added new errors during pruning, consider comparing output from http://shellcheck.net/ before and after. – Charles Duffy Oct 12 '17 at 21:08
  • BTW, `awk '/^--/ { print $3 }'` does all the work of `cat` *and* `grep` *and* your preexisting `awk`. And you might also consider `sort -u`, to avoid the need for a separate `uniq` pipeline component. – Charles Duffy Oct 12 '17 at 21:12
  • ...all that said, if you're trying to extract text from XML, better to use tools that know how to parse it (such as XMLStarlet) rather than trying to roll-your-own with `sed`. – Charles Duffy Oct 12 '17 at 21:13
  • I'd also strongly suggest `$()` instead of backticks for command substitution -- much less prone to causing syntax errors when nesting. – Charles Duffy Oct 12 '17 at 21:13
  • Thanks for the useful advice, everyone. I clearly have more to learn, but I was able to get it to work well enough. The problems, for me were that there were missing line-breaks in my source, and that the version of `sed` that macs use differs in some important ways from the one Linux uses. – Dylan Kinnett Oct 13 '17 at 19:04

1 Answers1

1

You need a semi-colon in your do { ... }

ie do { ...; }

user1675642
  • 747
  • 2
  • 5
  • 15
  • I tried two variants, but nether worked. They were `do { ...; }` and also `do { ...;};` – Dylan Kinnett Oct 12 '17 at 21:07
  • The `;` *is* definitely needed before the `}`, even if that's not the only thing that's wrong. There's other syntax inside that `do` block that looks pretty clearly wrong -- I'm wondering if maybe that code was copied from somewhere in a way that lost all the line breaks present in the original? – Charles Duffy Oct 12 '17 at 21:11