Bash _ wget _html2txt

Question

I was asked to use wget to download multiple url saved in a file and stock them in another folder. so I used this command:

wget -E -i url.txt -P ~/Desktop/ProjectM2/data/crawl

but Prob number 1 the files have to be named as follow:

1.html
2.html
3.html
..

and I tried manny things and I still can't do it.

Prob number 2 I don't know how to change all these files in one command using html2txt -utf8 from .html to .txt and keeping also the numbers

1.txt
2.txt
3.txt
..

thank you

What do you mean under `I tried manny things`? What did you try exactly? Can you show an example what urls are in url.txt? — marcell, Dec 16 '17 at 10:02
url like http://www.lefigaro.fr/culture/2010/12/31/03004-20101231ARTFIG00461-le-bonheur-est-dans-la-cuisine.php https://www.universalis.fr/encyclopedie/egypte-antique-histoire-l-egypte-pharaonique/ — Dreem AT, Dec 16 '17 at 10:37
How did you try the `foor loop`? Is the order of urls in url.txt important for the new names? I mean 1.html should be the first in url.txt, 2.html the second ... and so on? The resulting filenames from your wget command would be unpredictable like this, if the order is important you should go through the url.txt file line by line — marcell, Dec 16 '17 at 10:52
can you explain more please ? how can I go from line to another ? — Dreem AT, Dec 16 '17 at 12:39

marcell · Answer 1 · 2017-12-16T13:10:52.093

0

If in your case the order of urls in url.txt is important, that is, 1.html should contain the data of the first url, then 2.html should corresponds to the second url and so on then you can process the urls one by one.

The following script takes the desired action for each url:

#!/bin/bash

infile="$1"

dest_dir="~/Desktop/ProjectM2/data/crawl"

# create html and txt dir inside dest_dir
mkdir -p "$dest_dir"/{html,txt}

c=1
while IFS='' read -r url || [[ -n "$url" ]]; do

    echo "Fetch $url into $c.html"
    wget -q -O "$dest_dir"/html/$c.html "$url"

    echo "Convert $c.html to $c.txt"
    html2text -o "$dest_dir"/txt/$c.txt "$dest_dir"/html/$c.html

    c=$(( c + 1 ))

done < "$infile"

The script accounts for an input file, in this case url.txt. It creates two directories (html, txt) under your destination directory ~/Desktop/ProjectM2/data/crawl in order to better organize the resulting files. We read the urls from the file url.txt line by line with the help of a while loop (Read file line by line). With wget you can specify the desired output filename with the -O option, thus you can name your file as you wish, in your case a sequence number. The -q option is used to remove wget messages from the command line. In html2text you can specify the outputfile using -o.

edited Dec 16 '17 at 13:10

answered Dec 16 '17 at 12:58

marcell

1,498
1
10
22

I can't understand where did you put the path of the folder url.txt ? – Dreem AT Dec 16 '17 at 16:53
You have to save the code from above into a file, for example: fetch.sh. Then make it executable. Then you can run the script: ./fetch.sh path/to/url.txt. The infile=“$1” expression in the code do the magic. That is, take the first argument of the call and use it as infile variable. – marcell Dec 16 '17 at 17:25
I tried your script but I found a lot of errors so I tried this #!/bin/bash c=1 while read line; do wget -q -i ~/Bureau/ProjetM2_DreemAT/data/input/url.txt -O ~/Bureau/ProjetM2_DreemAT/data/crawl/$c.html html2text -o ~/Bureau/ProjetM2_DreemAT/data/crawl/$c.html ~/Bureau/ProjetM2_DreemAT/data/txt/$c.txt c=$(( c + 1 )); done < ~/Bureau/ProjetM2_DreemAT/data/input/url.txt so some files were created 1.html 2.html but with nothing inside plus 1.txt files were not created – Dreem AT Dec 16 '17 at 17:31
I posted my answer because it is tested and works. What was the lot of error you found? – marcell Dec 16 '17 at 17:49
Also it is wrong what you mentioned in your comment. You use the `-i` flag with input list in `wget` and you also iterate through this file.. – marcell Dec 16 '17 at 18:26

Bash _ wget _html2txt

1 Answers1