dwww Home | Manual pages | Find package

I18N::Charset(3pm)    User Contributed Perl Documentation   I18N::Charset(3pm)

NAME
       I18N::Charset - IANA Character Set Registry names and Unicode::MapUTF8
       (et al.) conversion scheme names

SYNOPSIS
         use I18N::Charset;

         $sCharset = iana_charset_name('WinCyrillic');
         # $sCharset is now 'windows-1251'
         $sCharset = umap_charset_name('Adobe DingBats');
         # $sCharset is now 'ADOBE-DINGBATS' which can be passed to Unicode::Map->new()
         $sCharset = map8_charset_name('windows-1251');
         # $sCharset is now 'cp1251' which can be passed to Unicode::Map8->new()
         $sCharset = umu8_charset_name('x-sjis');
         # $sCharset is now 'sjis' which can be passed to Unicode::MapUTF8->new()
         $sCharset = libi_charset_name('x-sjis');
         # $sCharset is now 'MS_KANJI' which can be passed to `iconv -f $sCharset ...`
         $sCharset = enco_charset_name('Shift-JIS');
         # $sCharset is now 'shiftjis' which can be passed to Encode::from_to()

         I18N::Charset::add_iana_alias('my-japanese' => 'iso-2022-jp');
         I18N::Charset::add_map8_alias('my-arabic' => 'arabic7');
         I18N::Charset::add_umap_alias('my-hebrew' => 'ISO-8859-8');
         I18N::Charset::add_libi_alias('my-sjis' => 'x-sjis');
         I18N::Charset::add_enco_alias('my-japanese' => 'shiftjis');

DESCRIPTION
       The "I18N::Charset" module provides access to the IANA Character Set
       Registry names for identifying character encoding schemes.  It also
       provides a mapping to the character set names used by the Unicode::Map
       and Unicode::Map8 modules.

       So, for example, if you get an HTML document with a META CHARSET="..."
       tag, you can fairly quickly determine what Unicode::MapXXX module can
       be used to convert it to Unicode.

       If you don't have the module Unicode::Map installed, the umap_
       functions will always return undef.  If you don't have the module
       Unicode::Map8 installed, the map8_ functions will always return undef.
       If you don't have the module Unicode::MapUTF8 installed, the umu8_
       functions will always return undef.  If you don't have the iconv
       library installed, the libi_ functions will always return undef.  If
       you don't have the Encode module installed, the enco_ functions will
       always return undef.

CONVERSION ROUTINES
       There are four main conversion routines: "iana_charset_name()",
       "map8_charset_name()", "umap_charset_name()", and
       "umu8_charset_name()".

       iana_charset_name()
           This function takes a string containing the name of a character set
           and returns a string which contains the official IANA name of the
           character set identified. If no valid character set name can be
           identified, then "undef" will be returned.  The case and
           punctuation within the string are not important.

               $sCharset = iana_charset_name('WinCyrillic');

       mime_charset_name()
           This function takes a string containing the name of a character set
           and returns a string which contains the preferred MIME name of the
           character set identified. If no valid character set name can be
           identified, then "undef" will be returned.  The case and
           punctuation within the string are not important.

               $sCharset = mime_charset_name('Extended_UNIX_Code_Packed_Format_for_Japanese');

       enco_charset_name()
           This function takes a string containing the name of a character set
           and returns a string which contains a name of the character set
           suitable to be passed to the Encode module.  If no valid character
           set name can be identified, or if Encode is not installed, then
           "undef" will be returned.  The case and punctuation within the
           string are not important.

               $sCharset = enco_charset_name('Extended_UNIX_Code_Packed_Format_for_Japanese');

       libi_charset_name()
           This function takes a string containing the name of a character set
           and returns a string which contains a name of the character set
           suitable to be passed to iconv.  If no valid character set name can
           be identified, then "undef" will be returned.  The case and
           punctuation within the string are not important.

               $sCharset = libi_charset_name('Extended_UNIX_Code_Packed_Format_for_Korean');

       mib_to_charset_name
           This function takes a string containing the MIBenum of a character
           set and returns a string which contains a name for the character
           set.  If the given MIBenum does not correspond to any character
           set, then "undef" will be returned.

               $sCharset = mib_to_charset_name('3');

       mib_charset_name
           This is a synonum for mib_to_charset_name

       charset_name_to_mib
           This function takes a string containing the name of a character set
           in almost any format and returns a MIBenum for the character set.
           For IANA-registered character sets, this is the IANA-registered
           MIB.  For non-IANA character sets, this is an unambiguous unique
           string whose only use is to pass to other functions in this module.
           If no valid character set name can be identified, then "undef" will
           be returned.

               $iMIB = charset_name_to_mib('US-ASCII');

       map8_charset_name()
           This function takes a string containing the name of a character set
           (in almost any format) and returns a string which contains a name
           for the character set that can be passed to Unicode::Map8::new().
           Note: the returned string will be capitalized just like the name of
           the .bin file in the Unicode::Map8::MAPS_DIR directory.  If no
           valid character set name can be identified, then "undef" will be
           returned.  The case and punctuation within the argument string are
           not important.

               $sCharset = map8_charset_name('windows-1251');

       umap_charset_name()
           This function takes a string containing the name of a character set
           (in almost any format) and returns a string which contains a name
           for the character set that can be passed to Unicode::Map::new(). If
           no valid character set name can be identified, then "undef" will be
           returned.  The case and punctuation within the argument string are
           not important.

               $sCharset = umap_charset_name('hebrew');

       umu8_charset_name()
           This function takes a string containing the name of a character set
           (in almost any format) and returns a string which contains a name
           for the character set that can be passed to
           Unicode::MapUTF8::new(). If no valid character set name can be
           identified, then "undef" will be returned.  The case and
           punctuation within the argument string are not important.

               $sCharset = umu8_charset_name('windows-1251');

QUERY ROUTINES
       There is one function which can be used to obtain a list of all IANA-
       registered character set names.

       "all_iana_charset_names()"
           Returns a list of all registered IANA character set names.  The
           names are not in any particular order.

CHARACTER SET NAME ALIASING
       This module supports several semi-private routines for specifying
       character set name aliases.

       add_iana_alias()
           This function takes two strings: a new alias, and a target IANA
           Character Set Name (or another alias).  It defines the new alias to
           refer to that character set name (or to the character set name to
           which the second alias refers).

           Returns the target character set name of the successfully installed
           alias.  Returns 'undef' if the target character set name is not
           registered.  Returns 'undef' if the target character set name of
           the second alias is not registered.

             I18N::Charset::add_iana_alias('my-alias1' => 'Shift_JIS');

           With this code, "my-alias1" becomes an alias for the existing IANA
           character set name 'Shift_JIS'.

             I18N::Charset::add_iana_alias('my-alias2' => 'sjis');

           With this code, "my-alias2" becomes an alias for the IANA character
           set name referred to by the existing alias 'sjis' (which happens to
           be 'Shift_JIS').

       add_map8_alias()
           This function takes two strings: a new alias, and a target
           Unicode::Map8 Character Set Name (or an existing alias to a Map8
           name).  It defines the new alias to refer to that mapping name (or
           to the mapping name to which the second alias refers).

           If the first argument is a registered IANA character set name, then
           all aliases of that IANA character set name will end up pointing to
           the target Map8 mapping name.

           Returns the target mapping name of the successfully installed
           alias.  Returns 'undef' if the target mapping name is not
           registered.  Returns 'undef' if the target mapping name of the
           second alias is not registered.

             I18N::Charset::add_map8_alias('normal' => 'ANSI_X3.4-1968');

           With the above statement, "normal" becomes an alias for the
           existing Unicode::Map8 mapping name 'ANSI_X3.4-1968'.

             I18N::Charset::add_map8_alias('normal' => 'US-ASCII');

           With the above statement, "normal" becomes an alias for the
           existing Unicode::Map mapping name 'ANSI_X3.4-1968' (which is what
           "US-ASCII" is an alias for).

             I18N::Charset::add_map8_alias('IBM297' => 'EBCDIC-CA-FR');

           With the above statement, "IBM297" becomes an alias for the
           existing Unicode::Map mapping name 'EBCDIC-CA-FR'.  As a side
           effect, all the aliases for 'IBM297' (i.e. 'cp297' and
           'ebcdic-cp-fr') also become aliases for 'EBCDIC-CA-FR'.

       add_umap_alias()
           This function works identically to add_map8_alias() above, but
           operates on Unicode::Map encoding tables.

       add_libi_alias()
           This function takes two strings: a new alias, and a target iconv
           Character Set Name (or existing iconv alias).  It defines the new
           alias to refer to that character set name (or to the character set
           name to which the existing alias refers).

           Returns the target conversion scheme name of the successfully
           installed alias.  Returns 'undef' if there is no such target
           conversion scheme or alias.

           Examples:

             I18N::Charset::add_libi_alias('my-chinese1' => 'CN-GB');

           With this code, "my-chinese1" becomes an alias for the existing
           iconv conversion scheme 'CN-GB'.

             I18N::Charset::add_libi_alias('my-chinese2' => 'EUC-CN');

           With this code, "my-chinese2" becomes an alias for the iconv
           conversion scheme referred to by the existing alias 'EUC-CN' (which
           happens to be 'CN-GB').

       add_enco_alias()
           This function takes two strings: a new alias, and a target Encode
           encoding Name (or existing Encode alias).  It defines the new alias
           referring to that encoding name (or to the encoding to which the
           existing alias refers).

           Returns the target encoding name of the successfully installed
           alias.  Returns 'undef' if there is no such encoding or alias.

           Examples:

             I18N::Charset::add_enco_alias('my-japanese1' => 'jis0201-raw');

           With this code, "my-japanese1" becomes an alias for the existing
           encoding 'jis0201-raw'.

             I18N::Charset::add_enco_alias('my-japanese2' => 'my-japanese1');

           With this code, "my-japanese2" becomes an alias for the encoding
           referred to by the existing alias 'my-japanese1' (which happens to
           be 'jis0201-raw' after the previous call).

KNOWN BUGS AND LIMITATIONS
       •   There could probably be many more aliases added (for convenience)
           to all the IANA names.  If you have some specific recommendations,
           please email the author!

       •   The only character set names which have a corresponding mapping in
           the Unicode::Map8 module are the character sets that Unicode::Map8
           can convert.

           Similarly, the only character set names which have a corresponding
           mapping in the Unicode::Map module are the character sets that
           Unicode::Map can convert.

       •   In the current implementation, all tables are read in and
           initialized when the module is loaded, and then held in memory
           until the program exits.  A "lazy" implementation (or a less-
           portable tied hash) might lead to a shorter startup time.
           Suggestions, patches, comments are always welcome!

SEE ALSO
       Unicode::Map
           Convert strings from various multi-byte character encodings to and
           from Unicode.

       Unicode::Map8
           Convert strings from various 8-bit character encodings to and from
           Unicode.

       Jcode
           Convert strings among various Japanese character encodings and
           Unicode.

       Unicode::MapUTF8
           A wrapper around all three of these character set conversion
           distributions.

AUTHOR
       Martin 'Kingpin' Thurn, "mthurn at cpan.org",
       <http://tinyurl.com/nn67z>.

LICENSE
       This module is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

perl v5.32.1                      2021-02-27                I18N::Charset(3pm)

Generated by dwww version 1.15 on Wed Jun 26 04:08:29 CEST 2024.