Unicode is an international character set standard that is capable of representing characters in all written languages. The Unicode standard specifies a numeric value (code point) and a name for each of its characters. The most frequently used characters have code point values that will fit into a 16-bit word in memory and on disk. (A word is the native unit of storage on a particular machine.) Characters whose code point values are larger than 0xFFFF
require two consecutive 16-bit words. These characters are called supplementary characters, and the two consecutive 16-bit words are called surrogate pairs.
There are a number of Unicode encoding forms. For example, UTF-8, UTF-16, UTF-32, and UCS-2. An encoding form defines how a Unicode code point is stored as a sequence of bytes.
UCS-2 is a subset of UTF-16. UCS-2 is identical to UTF-16 except that UTF-16 also supports supplementary characters.
ODBC supports Unicode in the form of Unicode data types and Unicode versions of the ODBC API. The encoding form that ODBC expects for data used with Unicode API functions is UCS-2.
The Perl DBD::ODBC module, when built with Unicode support (perl Makefile.PL -u
) is an ODBC application that uses the Unicode ODBC APIs and data types and passes UTF-16 encoded data to these APIs. As mentioned, for characters whose code point values are smaller than 0xFFFF
(the Basic Multilingual Plane), UTF-16 is identical to UCS-2.
Because Perl DBD::ODBC uses UTF-16, it's possible to use the module to insert supplementary characters. However, this does not mean that the target database treats them as a single Unicode character.
SQL Server 2008 and earlier don't treat supplementary characters as a single Unicode character. For example, its length
function returns 2 for a supplementary character. SQL Server 2012 introduced a set of supplementary character collations (which have the suffix _SC
) that support supplementary characters.
To illustrate this, the following Perl script uses the Collation and Unicode support