While there are many languages in which every necessary character can
be represented by a one-to-one mapping to a 8-bit value, there are also
several languages which require so many characters for written
communication that cannot be contained within the range a mere byte can
code. Multibyte character encoding schemes were developed to express
that many (more than 256) characters in the regular bytewise coding
system.
When you manipulate (trim, split, splice, etc.) strings encoded in a
multibyte encoding, you need to use special functions since two or more
consecutive bytes may represent a single character in such encoding
schemes. Otherwise, if you apply a non-multibyte-aware string function
to the string, it probably fails to detect the beginning or ending of
the multibyte character and ends up with a corrupted garbage string that
most likely loses its original meaning.
mbstring provides these multibyte specific
string functions that help you deal with multibyte encodings in PHP,
which is basically supposed to be used with single byte encodings.
In addition to that, mbstring handles character
encoding conversion between the possible encoding pairs.
mbstring is also designed to handle Unicode-based
encodings such as UTF-8 and UCS-2 and many single-byte encodings
for convenience (listed below), whereas mbstring was
originally developed for use in Japanese web pages.
Encodings of the following types are safely used with PHP.
A singlebyte encoding,
A multibyte encoding,
which has ASCII-compatible mappings for the characters in range of
00h to 7fh.
which don't use ISO2022 escape sequences.
which don't use a value from 00h to
7fh in any of the compounded bytes
that represents a single character.
These are examples of character encodings that are unlikely to work
with PHP.
Although PHP scripts written in any of those encodings might not work,
especially in the case where encoded strings appear as identifiers
or literals in the script, you can almost avoid using these encodings
by setting up the mbstring's transparent encoding
filter function for incoming HTTP queries.
Notatka:
It's highly discouraged to use SJIS, BIG5, CP936, CP949 and GB18030 for
the internal encoding unless you are familiar with the parser, the
scanner and the character encoding.
Notatka:
If you have some database connected with PHP, it is recommended that
you use the same character encoding for both database and the
internal encoding for ease of use and better
performance.
If you are using PostgreSQL, the character encoding used in the
database and the one used in the PHP may differ as it supports
automatic character set conversion between the backend and the frontend.
Summaries of supported encodings
Name in the IANA character set registry: ISO-10646-UCS-4
Underlying character set: ISO 10646
Description:
The Universal Character Set with 31-bit code space, standardized as UCS-4
by ISO/IEC 10646. It is kept synchronized with the latest version of the
Unicode code map.
Additional note:
If this name is used in the encoding conversion facility,
the converter attempts to identify by the preceding BOM
(byte order mark)in which endian the subsequent bytes
are represented.
Name in the IANA character set registry: ISO-10646-UCS-4
Underlying character set: UCS-4
Description:
See above.
Additional note:
In contrast to UCS-4, strings are always assumed
to be in big endian form.
Name in the IANA character set registry: ISO-10646-UCS-4
Underlying character set: UCS-4
Description:
See above.
Additional note:
In contrast to UCS-4, strings are always assumed
to be in little endian form.
Name in the IANA character set registry: ISO-10646-UCS-2
Underlying character set: UCS-2
Description:
The Universal Character Set with 16-bit code space, standardized as UCS-2
by ISO/IEC 10646. It is kept synchronized with the latest version of the
unicode code map.
Additional note:
If this name is used in the encoding conversion facility,
the converter attempts to identify by the preceding BOM
(byte order mark)in which endian the subsequent bytes
are represented.
Name in the IANA character set registry: ISO-10646-UCS-2
Underlying character set: UCS-2
Description:
See above.
Additional note:
In contrast to UCS-2, strings are always assumed
to be in big endian form.
Name in the IANA character set registry: ISO-10646-UCS-2
Underlying character set: UCS-2
Description:
See above.
Additional note:
In contrast to UCS-2, strings are always assumed
to be in little endian form.
Name in the IANA character set registry: UTF-32
Underlying character set: Unicode
Description:
Unicode Transformation Format of 32-bit unit width, whose encoding space
refers to the Unicode's codeset standard. This encoding scheme wasn't
identical to UCS-4 because the code space of Unicode were limited to
a 21-bit value.
Additional note:
If this name is used in the encoding conversion facility,
the converter attempts to identify by the preceding BOM
(byte order mark)in which endian the subsequent bytes
are represented.
Name in the IANA character set registry: UTF-32BE
Underlying character set: Unicode
Description: See above
Additional note:
In contrast to UTF-32, strings are always assumed
to be in big endian form.
Name in the IANA character set registry: UTF-32LE
Underlying character set: Unicode
Description: See above
Additional note:
In contrast to UTF-32, strings are always assumed
to be in little endian form.
Name in the IANA character set registry: UTF-16
Underlying character set: Unicode
Description:
Unicode Transformation Format of 16-bit unit width. It's worth a note
that UTF-16 is no longer the same specification as UCS-2 because the
surrogate mechanism has been introduced since Unicode 2.0 and
UTF-16 now refers to a 21-bit code space.
Additional note:
If this name is used in the encoding conversion facility,
the converter attempts to identify by the preceding BOM
(byte order mark)in which endian the subsequent bytes
are represented.
Name in the IANA character set registry: UTF-16BE
Underlying character set: Unicode
Description:
See above.
Additional note:
In contrast to UTF-16, strings are always assumed
to be in big endian form.
Name in the IANA character set registry: UTF-16BE
Underlying character set: Unicode
Description:
See above.
Additional note:
In contrast to UTF-16, strings are always assumed
to be in big endian form.
Name in the IANA character set registry: UTF-8
Underlying character set: Unicode / UCS
Description:
Unicode Transformation Format of 8-bit unit width.
Additional note: none
Name in the IANA character set registry: UTF-7
Underlying character set: Unicode
Description:
A mail-safe transformation format of Unicode, specified in
RFC2152.
Additional note: none
Name in the IANA character set registry: (none)
Underlying character set: Unicode
Description:
A variant of UTF-7 which is specialized for use in the
IMAP protocol.
Additional note: none
Name in the IANA character set registry:
US-ASCII (preferred MIME name) / iso-ir-6 / ANSI_X3.4-1986 /
ISO_646.irv:1991 / ASCII / ISO646-US / us / IBM367 / CP367 / csASCII
Underlying character set: ASCII / ISO 646
Description:
American Standard Code for Information Interchange is a commonly-used
7-bit encoding. Also standardized as an international standard, ISO 646.
Additional note: (none)
Name in the IANA character set registry:
EUC-JP (preferred MIME name) /
Extended_UNIX_Code_Packed_Format_for_Japanese / csEUCPkdFmtJapanese
Underlying character set:
Compound of US-ASCII / JIS X0201:1997 (hankaku kana part) /
JIS X0208:1990 / JIS X0212:1990
Description:
As you see the name is derived from an abbreviation of Extended UNIX Code
Packed Format for Japanese, this encoding is mostly used on UNIX or
alike platforms. The original encoding scheme, Extended UNIX Code, is
designed on the basis of ISO 2022.
Additional note:
The character set referred to by EUC-JP is different to IBM932 / CP932,
which are used by OS/2