5

I have a long text file which uses apparently different encodings in subsequent blocks of text (iso or utf-8). It is the result of appending text using >> file.bib and copy and paste from different sources (webpages).

The blocks can in principle be distinguished as they are bibtex entries

 @article{key, author={lastname, firstname}, ...}

I would like to convert it to a coherent utf-8 file since it seems to crash my bibtex viewer (kbibtex). I know that I can use iconv to convert the encoding of entire files, but I would like to know if there is a way to fix my file without corrupting some of the entries.

dwalter
  • 7,258
  • 1
  • 32
  • 34
highsciguy
  • 2,569
  • 3
  • 34
  • 59
  • 4
    Give much more details, see [Questions about converting a mixed-encoding file to UTF8 in Perl](http://stackoverflow.com/questions/6897982/questions-about-converting-a-mixed-encoding-file-to-utf8-in-perl) for a comparison what info is useful. – daxim May 21 '12 at 14:54
  • You should start by splitting the file into the individual HTML documents. Then you can check each documents for a BOM and for a charset in the HEAD element. – ikegami May 21 '12 at 16:29

2 Answers2

3

If you can assume uniform encoding for each line AND you know the alternate encoding:

#!/usr/bin/perl
use Encode;
while(<>) {
      my $line;
      eval {
        $line=Encode::decode_utf8( $_ );
      }
      if ($@) $line=Encode::decode( 'iso-8859-1', $_ ); #not UTF-8
      # Now $line is UNICODE.Do something to it

} 

You can still do the same by words if the lines are mixed encoding, but you still know what is the alternate encoding. If do not know the alternate encoding, or if you have more than one, you need to use some encode-guessing library, which may well guess wrong.

Alien Life Form
  • 1,884
  • 1
  • 19
  • 27
  • 2
    It it's between UTF-8 and iso-8859-1, use the `fix_latin` tool that comes with [Encoding::FixLatin](http://search.cpan.org/perldoc?Encoding::FixLatin)'s instead of Alien Life Form's code. – ikegami May 21 '12 at 16:30
3

I use vim for this, but I guess it can be done in any editor.

  • Select (shift+v) a block of text that you want to change encoding on.

  • type :!enca -L lang - (replace 'lang' with your language, I use 'enca -L cs'. enca utility should then tell you the most probable encoding of the selected block)

  • press u (so you undo the answer of enca that appeared in your text)

  • select the block again, this time running :!iconv -f determined_encoding -t UTF-8

Note that vim automatically expands pressed : to :\<,> when you're in visual mode, which is exactly what you want for running programs on text blocks.

exa
  • 859
  • 7
  • 27