2011年7月30日 星期六

Perl 判斷文字檔編碼格式

原文 http://www.perlmonks.org/?node_id=256728


Windows的ActivePerl好像沒有File::BOM, File::MMagic。
最後一個方法是讀檔案前面兩個bytes來判斷。
Example:

open $FH_i, "<", "unicode.txt";
read $FH_i, $buf, 2, 0;
close $FH_i;

@File_head = split(//, $buf);
if (($File_head[0] eq "\xFF") && ($File_head[1] eq "\xFE")) {
  print "This is unicode Little Endian file.\n";
} elsif (($File_head[0] eq "\xFE") && ($File_head[1] eq "\xFF")) {
  print "This is unicode Big Endian file.\n";
} else {
  print "This is ASCII file.\n";
}

How do I determine encoding format of a file ?

Perl 5.8 has a module called "Encode::Guess", which might work well if you know the language involved and/or can provide some hints as to the likely candidates. (I haven't tried it yet, but it is admittedly limited and speculative at present.)
Answer: How do I determine encoding format of a file ?
contributed by idsfa
File::BOM provides get_encoding_from_filehandle and get_encoding from_stream to identify the encoding of Unicode files. Example:
use File::BOM qw( :all ); open $fh, '<', $filename; my ($encoding) = get_encoding_from_filehandle($fh);
Answer: How do I determine encoding format of a file ?
contributed by particlehave a look at File::MMagic, it guesses the filetype given the filename or a filehandle, and is quite configurable (you can add more file type descriptions based on regular expressions.) it's a handy little module.
Answer: How do I determine encoding format of a file ?
contributed by donno20Read the first two bytes of the file. Corresponding encoding and hex codes are as follow:
unicode Little Endian = "\xFF\xFE"
unicode Big Endian = "\xFE\xFF"
utf8 = "\xEF\xBB"
ASCII = straight to content

沒有留言:

張貼留言