2

I want to convert the output of any website got with curl into utf8 for a database Insert.

usage ex:

html="$(curl -iL -compressed "$link")"

##code needed to convert nonUTF8 $html to utf8, preferably without writing to file

## escape characters for insert
html_q="${html//'\'/\\\\}"
html_q="${html_q//"'"/\'}"

## the insert statement
sqlHtml='INSERT INTO `'"${tableHtml}"'` (`html`) VALUES ('"'${html_q}'"');'
mysql -u$dbUser -p$dbPass -h$dbHost -P$dbPort -D$dbName --default_character_set utf8 -A <<ENDofMESSAGE
${sqlHtml}
ENDofMESSAGE
Stefan Rogin
  • 1,499
  • 3
  • 25
  • 41
  • How do you expect to determine what the character set of the data in `$html` is? You can't convert random garbage to UTF-8 and expect the result to be sensible, you need to know what character set you are converting from. – cdhowie Jul 17 '12 at 14:36
  • probably using the header from curl, like http://stackoverflow.com/questions/2510868/php-convert-curl-exec-output-to-utf8 – Stefan Rogin Jul 17 '12 at 14:38
  • Well, this is bash and not PHP. I'm not sure that the curl command-line client gives you easy access to this header. You might consider writing this in Python -- see [this answer](http://stackoverflow.com/a/3683863/501250) for a possible solution. – cdhowie Jul 17 '12 at 14:56
  • Not sure what your SQL table looks like, but if you plan to search them by content in the page unindexed, it is no better than grep. – pizza Jul 18 '12 at 22:53

3 Answers3

9

Short question, short answer :

man iconv

Now, you have one more problem : determining what is the source encoding of your web page. (tip: type charsetdetector in google)

Scharron
  • 17,233
  • 6
  • 44
  • 63
0

It cannot be done correctly without a parser in the general case. Scripting won't cut it. If your goal is to store the page, treat it as binary, compress and convert to a printable form.

pizza
  • 7,296
  • 1
  • 25
  • 22
  • define `parser`, give an example. In my problem I would like to mimic what the browser outputs, use that algorithm to specify the encoding and convert it to utf8. – Stefan Rogin Jul 18 '12 at 06:57
  • also if I store it as binary how would I search through it if I don't know the encoding? – Stefan Rogin Jul 18 '12 at 07:53
  • software that understand html syntax which can separate html into different elements properl, web browser has one, you don't know the encoding of the page unless you look at http content header and/or the meta-tag of the html, – pizza Jul 18 '12 at 08:14
  • I don't know what your goal is, if you just want the text, use a browser like lynx and dump it out as formatted text instead of curl. – pizza Jul 18 '12 at 08:15
0

Here is the solution I went for:

#!/bin/bash 

result="$( { stdout="$(curl -Lsv -compressed "$1")" ; } 2>&1; echo "--SePaRaToR--"; echo "$stdout")"; 
echo '
found:'
echo "$result" | grep -o '\(charset\|encoding\)[ ]*=[ ]*["]*[a-zA-Z0-9_: -]*'
echo ' '
status=1
charset="ISO_8859-1" #set default
# 1: HTTP Content-Type: header 
# 2: <meta> element in the page 
# 3: <xml> element in the page
regex='.*(charset|encoding)\s*=\s*["]*([a-zA-Z0-9_: -]*)'
if [[ "$result" =~ $regex  ]]
    then
        charset="${BASH_REMATCH[2]}"    
        status=2
        echo "match succes: $charset"
    else 
        echo "match fail: $charset : ${BASH_REMATCH[2]}" 
fi


if [[ "$charset" == *utf-8* || "$charset" == *UTF-8* ]]
    then
        charset='NotModified'
    else
    echo "iconv '$charset' to UTF-8//TRANSLIT"
    html=$(echo "$result" | iconv -f"$charset" -t'UTF-8//TRANSLIT')
    if [ $? -ne 0 ] 
        then
        echo "translit failed : iconv '$charset' to UTF-8//IGNORE"
        html=$(echo "$result" | iconv -f"$charset" -t"UTF-8//IGNORE")
        if [ $? -ne 0 ] 
            then            
            charset="ISO_8859-1"
            echo "ignore failed : iconv '$charset' to UTF-8//IGNORE"
            html=$(echo "$result" | iconv -f"$charset" -t"UTF-8//IGNORE")
            status=4
        fi
        status=3
    fi

fi
echo "charset: '$charset' , status: '$status'"

The default is the w3c recomandation.
It's not 100% accurate but it's fast and it will do it's job 99% of the time.

Hope it helps someone in the same situation.
Also thanks to all that answered.

Stefan Rogin
  • 1,499
  • 3
  • 25
  • 41
  • if there is no 1,2,3 in the page, and there is data content i.e. "" then the iconv command will fail. Don't you want to limit $result to a acceptable valid subset? – pizza Jul 18 '12 at 17:37
  • I could add something like: `IF [ "${BASH_REMATCH[2]}" in $(iconv --list) ] then charset="${BASH_REMATCH[2]}" ; else leave default or try previous match; fi` but that would slow down the process, and I need to scan over 100k domains/day. I will handle these errors in post processing and re-scan them afterwards with a heavier algorithm. – Stefan Rogin Jul 19 '12 at 08:08