This will largely depend on your shell and the tools it has available to read
up to the point of a delimiter, ';'
in this case. If you have BASH or some similar shell, then the solution is trivial with substring removal:
#!/bin/bash
while read -d ';' -r line; do
((${#line} >= 12)) && [[ $line =~ http ]] &&
printf "%s\n" ${line##*/}
done < "$1"
If you are limited to POSIX shell, then the solution takes quite a bit more work and you will relay on sed
to parse the URL once it is isolated. Isolating each URL from a line containing multiple URLs separated by semi-colons is a bit tricky. In POSIX shell you basically have to inchworm down each line read reading a character at a time (which is slow on large files) The following validates that each line considered as a URL contains http
at the beginning:
#!/bin/sh
url=""
while read -r line; do
len=`expr length "$line"`
urlstart=`expr index "$line" "h"`
line=`expr substr "$line" "$urlstart" "$len"`
while [ $len -gt 0 ]; do
c=`expr substr "$line" 1 1`
if [ x$c = 'x;' ]; then
if [ `expr length "$url"` -ge 12 ]; then
# printf "url: %s\n" "$url"
ulen=`expr length "$url"`
urlstart=`expr index "$url" "h"`
if [ $urlstart -gt 0 ]; then
if [ $urlstart -gt 1 ]; then
url=`expr substr "$url" "$urlstart" "$ulen"`
fi
urlflag=0
while [ `expr substr "$url" 1 4` != http ]; do
url=`expr substr "$url" 2 "$ulen"`
urlstart=`expr index "$url" "h"`
if [ "$urlstart" -eq 0 ]; then
urlflag=1
break
fi
url=`expr substr "$url" "$urlstart" "$ulen"`
ulen=`expr length "$url"`
if [ $ulen -le 12 ]; then
urlflag=1
break
fi
done
if [ $urlflag -ne 1 ]; then
if [ `expr substr "$url" 1 4` = http ]; then
echo "$url" | sed -e 's/http.*\///'
fi
fi
fi
fi
url=""
else
url="$url$c"
fi
line=`expr substr "$line" 2 "$len"`
len=`expr length "$line"`
done
done <"$1"
If you can ensure that only lines longer than some constant are the URLs, then you can dramatically improve the performance of the POSIX solution by not searching for and validating each string contains http
. To parse the URLs based on length, something similar to the following will work:
#!/bin/sh
while read -r line; do
printf "\n%s\n\n" "$line"
len=`expr length "$line"`
sidx=`expr index "$line" ";"`
while [ $len -gt 0 ]; do
if [ $sidx -gt 0 ]; then
let end=sidx-1
str=`expr substr "$line" 1 "$end"`
slen=`expr length "$str"`
if [ $slen -gt 12 ]; then
echo "$str" | sed -e 's/^.*\///'
fi
else
if [ $len -gt 12 ]; then
echo "$line" | sed -e 's/^.*\///'
fi
break;
fi
let start=sidx+1
line=`expr substr "$line" "$start" "$len"`
len=`expr length "$line"`
sidx=`expr index "$line" ";"`
done
done <"$1"
The length of 12
was simply arrived at as the shortest possible URL for a jpeg file (e.g. http://a.jpg
)
In all cases, the results are the same for the example file you have given:
Input
$ cat dat/httppaths.txt
;56.95;Vilten tas met twee hanen in kleur http://staging.tassenmagazijn.nl/\
media/catalog/product/cache/0/image/a94bc919ee025799dd7ec1f1b7884918/1/0/10_vilte\
n_tas_2b_2.jpg;http://staging.tassenmagazijn.nl/media/catalog/product/2/0/10_vilte\
n_tas_2c_1.jpg;http://staging.tassenmagazijn.nl/media/catalog/product/2/0/10_vilte\
n_tas_2d_1.jpg;;;;;;;Fixed;New;Send;True;
Use/Output
$ sh parsehttppath.sh dat/httppaths.txt
10_vilten_tas_2b_2.jpg
10_vilten_tas_2c_1.jpg
10_vilten_tas_2d_1.jpg