Some common Unicode problems and solutions demonstrated with Perl DBD::ODBC and Microsoft SQL Server

Contents

Introduction

In this article, I have tried to collect some useful information to help you use Unicode in Microsoft SQL Server from Perl DBD::ODBC. Many of the problems listed here come from questions on the dbi-user list, perlmonks or emails direct to me.

I have tried very hard to make all the examples work on Windows and UNIX but:

When writing this article, I discovered a bug in DBD::ODBC (when inserting into CHAR or VARCHAR columns) that affects all versions of DBD:ODBC that support Unicode up until 1.46_1, when it was fixed. I've tried to highlight the issue in the following examples.

Terminology

Some DBD::ODBC background and history

DBD::ODBC has not always supported Unicode. It all started with a patch from Alexander Foken. Around version 1.14, Alexander's original patch was adapted to include optional Unicode support for UNIX.

For DBD::ODBC, building without Unicode support really means build as DBD::ODBC worked before Unicode support was added, to maintain backwards compatibility.

The initial Unicode support was for Windows only and allowed you to send or retrieve NCHAR and NVARCHAR columns as SQL_WCHARs. DBD::ODBC's initial Unicode support:

Since then Unicode support has grown to include:

For full documentation on Unicode in DBD::ODBC, refer to the Unicode section in the DBD::ODBC documentation.

ODBC and Unicode

The ODBC API has two main sets of APIs; the ANSI API (SQLxxxA APIs) and the Unicode API (SQLxxxW APIs). By default, DBD::ODBC uses the Unicode API on Windows and the ANSI API on non-Windows platforms. There are good historical reasons for this beyond the scope of this article. If you want to use DBD::ODBC to work with Unicode data in SQL Server from non-Windows platforms, you need:

How can you tell what support for the Unicode ODBC API you have already got?

What happens if I try to use Unicode from Perl DBD::ODBC and a component in the chain does not support the Unicode APIs?

The simple answer is you won't be able to insert, update, delete, or select Unicode data from Microsoft SQL Server:

Example 1: Simple INSERT and SELECT with a non-Unicode DBD::ODBC

# ex1.pl
use 5.008001;
use strict;
use warnings;
use DBI qw{:utils};

my $unicode = "\x{20ac}";       # Unicode euro symbol
my $h = DBI->connect or die $DBI::errstr;
$h->{RaiseError} = 1;

eval {$h->do(q/drop table unicode_test/)};
$h->do(q/create table unicode_test (a nvarchar(20))/);

my $s = $h->prepare(q/insert into unicode_test (a) values(?)/);
$s->execute($unicode);

my $r = $h->selectrow_arrayref(q/select a from unicode_test/);
my $data = $r->[0];
print "DBI describes data as: ", data_string_desc($data), "\n";
print "Data Length: ", length($data), "\n";
print "hex ords: ";
foreach my $c(split(//, $data)) {
    print sprintf("%x,", ord($c));
}
print "\n";

which outputs:

DBI describes data as: UTF8 off, non-ASCII, 3 characters 3 bytes
Data Length: 3
hex ords: e2,82,ac,

We attempted to insert a Unicode Euro symbol and when we selected it back, we got 3 characters and 3 bytes instead of 1 character and 3 bytes. This is confirmed by the fact that the Perl data contains the UTF-8 encoding for a Euro.

An explanation of what happened:

  1. The column was created as an NVARCHAR, so Microsoft SQL Server should be happy to accept Unicode characters for the column data.
  2. DBD::ODBC prepared the SQL and asked the ODBC driver to describe the parameter. DBD::ODBC was told it was a UNICODE VARCHAR of length 20 characters. However, it bound the parameter as a value type of SQL_C_CHAR and a parameter type of SQL_C_WCHAR, so the driver interpreted each byte as a character.
  3. When we read the data back, we got the bytes back as Perl had encoded the Euro internally (UTF-8).

You might be asking yourself at this point why DBD::ODBC bound the data as a value type of SQL_C_CHAR and the answer is backwards compatibility: that's what it did for a long time before support for the Unicode API was added.

So what if we force DBD::ODBC to bind the data as a SQL_WCHAR?

Example 2: INSERT and SELECT statements with a non-Unicode DBD::ODBC that force SQL_WCHAR

The code for this is nearly identical to the above except we add a bind_param call and import :sql_types from DBI.

use 5.008001;
use strict;
use warnings;
use DBI qw{:utils :sql_types};

my $unicode = "\x{20ac}";       # Unicode euro symbol
my $h = DBI->connect or die $DBI::errstr;
$h->{RaiseError} = 1;

eval {$h->do(q/drop table unicode_test/)};
$h->do(q/create table unicode_test (a nvarchar(20))/);

my $s = $h->prepare(q/insert into unicode_test (a) values(?)/);
$s->bind_param(1, undef, {TYPE => SQL_WVARCHAR});
$s->execute($unicode);

my $r = $h->selectrow_arrayref(q/select a from unicode_test/);
my $data = $r->[0];
print "DBI describes data as: ", data_string_desc($data), "\n";
print "Data Length: ", length($data), "\n";
print "hex ords: ";
foreach my $c(split(//, $data)) {
    print sprintf("%x,", ord($c));
}
print "\n";

and the output is:

DBI describes data as: UTF8 off, non-ASCII, 3 characters 3 bytes
Data Length: 3
hex ords: e2,82,ac,

Exactly the same as before. Why? The TYPE argument passed to bind_param sets the SQL type (the parameter type) and not the value type in a SQLBindParameter call.

Reading properly written Unicode in a non-Unicode DBD::ODBC

Now what if the Unicode Euro was inserted correctly by something else and we want to read it using a non-Unicode DBD::ODBC?

Example 3: Unicode from a non-Unicode DBD::ODBC

We have got a valid Unicode Euro symbol in the database, stored in an NVARCHAR column. (Do not worry about how for now, this is just showing what happens when the data in the database is correct, but you use the wrong method to get it.)

use 5.008001;
use strict;
use warnings;
use DBI qw{:utils};

my $unicode = "\x{20ac}";       # Unicode euro symbol
my $h = DBI->connect or die $DBI::errstr;
$h->{RaiseError} = 1;

my $r = $h->selectrow_arrayref(q/select a from unicode_test/);
my $data = $r->[0];
print "DBI describes data as: ", data_string_desc($data), "\n";
print "Data Length: ", length($data), "\n";
print "hex ords: ";
foreach my $c(split(//, $data)) {
    print sprintf("%x,", ord($c));
}
print "\n";

which outputs:

DBI describes data as: UTF8 off, non-ASCII, 1 characters 1 bytes
Data Length: 1
hex ords: 80

Note that what you get back in data here very much depends on the ODBC driver and platform. On Windows, you would probably get the above because 0x80 is the Windows-1252 character for a Euro. (If it had been something not in Windows-1252, it would probably have returned a question mark.) With some Microsoft SQL Server ODBC drivers on UNIX, you could get any of the following (and perhaps more):

The point of this illustration is that you do not really want to do any of the above unless you have no choice.

You might be saying to yourself, yes but you can set a type in the bind_col method, so you can control how the data is returned to you. Mostly that's not true for just about all Perl DBDs I know of, and with DBD::ODBC although you can override the default type in a bind_col call, you can only do it for decimals and timestamps.

Using VARCHAR columns instead of NVARCHAR columns for Unicode data

If you're using DBD::ODBC before 1.46_1, do not do this. There's a bug in DBD::ODBC before 1.46_1 that means it does not look at the Perl scalars you're binding for input and always binds them using the type the driver describes the column as (which will always be SQL_CHAR for a VARCHAR column).

Generally speaking, you should use NCHAR and NVARCHAR columns when you need to support multiple languages in the same column, although even that is not always necessary. For example, you can support English, German, and Italian in one Windows codepage. A better recommendation would be to use N columns for user provided data that's unconstrained and VARCHAR columns for data that is constrained and you control. For example, a number plate, serial number, and so on.

However, in the spirit of describing why, let's look at some examples. These examples assume we're now using a Unicode DBD::ODBC with an ODBC driver that supports ODBC.

So we return to our first simple example, but now run it with a Unicode DBD::ODBC, use a VARCHAR column and try two different bind types (the default and an overridden one):

Example 4: Simple INSERT and SELECT with a Unicode DBD::ODBC using VARCHAR

use 5.008001;
use strict;
use warnings;
use DBI qw{:utils :sql_types};

my $unicode = "\x{20ac}";       # unicode euro symbol
my $h = DBI->connect or die $DBI::errstr;
$h->{RaiseError} = 1;

eval {$h->do(q/drop table unicode_test/)};
$h->do(q/create table unicode_test (a varchar(20))/);

my $s = $h->prepare(q/insert into unicode_test (a) values(?)/);
$s->execute($unicode);
$s->bind_param(1, undef, {TYPE => SQL_WVARCHAR});
$s->execute($unicode);

my $r = $h->selectall_arrayref(q/select a from unicode_test/);
foreach my $r (@$r) {
    my $data = $r->[0];
    print "DBI describes data as: ", data_string_desc($data), "\n";
    print "Data Length: ", length($data), "\n";
    print "hex ords: ";
    foreach my $c(split(//, $data)) {
        print sprintf("%x,", ord($c));
    }
    print "\n";
}

which outputs:

DBI describes data as: UTF8 on, non-ASCII, 3 characters 6 bytes
Data Length: 3
hex ords: e2,201a,ac,
DBI describes data as: UTF8 on, non-ASCII, 1 characters 3 bytes
Data Length: 1
hex ords: 20ac,

Here again, you'll get different results depending on the platform and ODBC driver.

I imagine this is really going to make you wonder what on earth has happened here. Bear in mind, in Perl, the Euro is internally encoded in UTF-8 as 0xe2, 0x82, and 0xac.

In the first INSERT, DBD::ODBC does what it always did and asked the database what the column type was. The database returned SQL_CHAR and the Euro was bound as a SQL_CHAR (the bug). In the second case, we overrode DBD::ODBC and told it to bind the data as a SQL_WVARCHAR.

When we retrieved the data, DBD::ODBC bound the column as SQL_WCHAR (which it always does in a Unicode build).

As far as Microsoft SQL Server is concerned, this is a VARCHAR column, you wanted to insert three characters (codes 0xe2, 0x82, and 0xac). This is what end ups in the database, as confirmed when we read them back as binary data. However, where did the character with the code 0x201a come from? When DBD::ODBC read the data back, it bound the column as a SQL_C_WCHAR and therefore asked SQL Server to convert the characters in the VARCHAR column to wide (UCS-2 or UTF-16) characters and guess what? Character 82 in the Windows-1252 character set (which I was using when running this code) is "curved quotes" with the Unicode value 0x201A. 0xe2 and 0xac in Windows-1252 are the same character code in Unicode.

In the second row, we bound the data as a SQL_WCHAR for the INSERT (and the database maps those Unicode characters to the codepage for the table) and we bound the data as a SQL_WCHAR for the SELECT. The ODBC driver maps the codepage characters retrieved back to Unicode, and we get back what we inserted (as long as all the characters we inserted fit into the current codepage). However, had we tried to insert a character not in the Windows-1252 codepage, SQL Server would substitute that character with a ?. We should not have had to override the bind type here and that was the bug in DBD::ODBC pre 1.46_1.

This last point is really important. On Windows, the ODBC driver knows your codepage and can do this, but on UNIX it is not uncommon for you to have to tell the ODBC driver what codepage you want to use for the conversion. For example, with the Easysoft SQL Server ODBC driver, you can set Client_cset = WINDOWS=1252 or you could add use_lcid=Yes to your connection and the ODBC driver will work out the correct codepage to use.

Here's a Windows specific version of the above test with a few more enhancements:

#
# A simple demonstration of why you cannot use char and varchar columns
# in Microsoft SQL Server to store Unicode. char and varchar columns use a codepage
# and Unicode characters inserted into them are converted to the codepage.
# If a conversion does not exist in the codepage the characters which don't
# convert will become '?'
#
# Show the diference between binding as SQL_CHAR and SQL_WCHAR.
#
# Refer to http://msdn.microsoft.com/en-us/library/bb330962.aspx#intlftrql2005_topic2
#
use 5.008.001;
use strict;
use warnings;
use DBI qw(:sql_types);
use Data::Dumper;
use Win32::API;

# chcp on my machines normally gives 850
# http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/chcp.mspx?mfr=true
#
# we want windows-1252 so run: chcp 1252 first

#use open qw( :encoding(Windows-1252) :std );
binmode STDOUT, ":encoding(cp1252)";

# get active codepage and ensure it is cp1252
# http://stackoverflow.com/questions/1259084/what-encoding-code-page-is-cmd-exe-using
Win32::API::More->Import("kernel32", "UINT GetConsoleOutputCP()");
my $cp = GetConsoleOutputCP();
print "Current active console code page: $cp\n";
if ($cp != 1252) {
    print "Change to codepage 1252 - run chcp 1252\n";
    die "Incompatible active codepage - change to codepage 1252 by running chcp 1252\n";
}

my $h = DBI->connect() or die $DBI::errstr;
$h->{RaiseError} = 1;
$h->{PrintError} = 1;

eval {$h->do(q/drop table varchar_test/)};
$h->do(q/create table varchar_test(a varchar(20) collate Latin1_General_CI_AS)/);

# note codepage 1252 is 255 chrs including the euro at 0x80
# windows-1252 does not contain U+0187 but it does contain
# the euro symbol (U+20ac), the curved quotes (U+201A),
# Latin Small Letter F with hook (U+192), dagger (U+2020)
# mixing code pages in SQL Server is not recommended
my $insert = $h->prepare(q/insert into varchar_test (a) values(?)/);
my $data = "\x{20ac}\x{201A}\x{192}\x{2020}\x{187}" ;
{
    use bytes;
    print "encoded length of our data is:", length($data), "\n";
    print "encoded data in hex is:";
    foreach my $b(split(//, $data)) {
        print sprintf("%x,", ord($b));
    }
    print "\n";
}
# this execute will discover the column is varchar and bind the perl scalar
# as SQL_CHAR meaning the UTF-8 encoded data in the perl scalar
# will be inserted as separate characters not all of which will even
# be translateable to the current codepage.
$insert->execute($data);
# Now we force DBD::ODBC to insert the parameter as SQL_WVARCHAR
$insert->bind_param(1, undef, {TYPE => SQL_WVARCHAR});
$insert->execute($data);

print "\nNotice in the first row (which was inserted as SQL_CHAR), the UTF-8 stored\n"
print "\nin the perl scalar is mostly stored as individual characters but then you will be\n"
print "\nwondering why few of the characters seem to come back as Unicode. Windows sees individual\n"
print "\ncharacters in the UTF-8 sequence as characters in the windows-1252 codepage and the UTF-8 sequence\n"
print "\ncontains some characters in windows-1252 which map back to Unicode chrs. e.g., the UTF-8 sequence for the\n"
print "\neuro is e2, 82, ac and windows see the 82 as the curved quotes in windows-1252 but when you ask for it back\n"
print "\nas wide/Unicode characters it maps it to U+201a (curved quotes Unicode point)\n";
print "\nNotice how in the second row the last character is a ?. That is because U+0187 does not exist in windows-1252\n"
print "\ncodepage our column is using\n";
my $r = $h->selectall_arrayref(q/select a from varchar_test/);
print Dumper($r);
foreach my $row (@$r) {
    print $row->[0], "\n";
}
$h->disconnect;

The correct way to do Unicode with DBD::ODBC and SQL Server

When retrieving rows on Windows or UNIX (and using a Unicode DBD::ODBC), your CHAR, VARCHAR, NCHAR, and NVARCHAR columns should all be correct. Even when you use CHAR and VARCHAR columns that use a codepage, SQL Server will convert any character in the codepage to a Unicode codepoint and DBD::ODBC will encode them as UTF-8 and mark them Unicode to Perl. This is because DBD::ODBC asks for the column data as SQL_WCHAR.

When inserting Unicode, DBD::ODBC will normally just do the right thing: use SQL_WCHAR for NCHAR and NVARCHAR columns. If your column is a CHAR or VARCHAR then prior to 1.46_1 it may do the wrong thing by default. Until 1.46_1, DBD::ODBC ignored your Perl data and bound it as the type that the ODBC driver reported for the parameter. In 1.46_1 and later, DBD::ODBC checks your scalar for the parameter first to see it has the utf8 flag on it.

Surrogate pairs (Unicode code points above U+FFFF)

ODBC supports Unicode in the form of SQL_WCHAR types and Unicode versions of the ODBC API. The encoding form that ODBC expects for data used with Unicode API functions is UCS-2 (or at least that is how it looked when Windows and Microsoft SQL Server were using UCS-2).

In anticipation of things changing, when Unicode support was added to DBD::ODBC, the internal code actually converts Perl UTF-8 encoded strings into UTF-16, which for characters in the Basic Multilingual Plane is identical to UCS-2. As there were no Unicode ODBC drivers supporting supplementary characters (above U+0FFFF) this could not be proven as a good decision. However, at worst, it meant code using Unicode outside the Basic Multilingual Plane would just insert what looked to the database as more characters.

Older versions of SQL Server and surrogate pairs

As it turned out, the decision in DBD::ODBC to use UTF-16 was exactly what Microsoft initially did and versions of Microsoft SQL Server like 2000, 2005 and 2008 are described as surrogate neutral rather than surrogate aware. Microsoft had this to say at the time:

Both SQL Server 2000 and SQL Server 2005 can store surrogate pairs, even though UCS-2 is not aware of surrogates. SQL Server treats the surrogate pairs as two undefined Unicode characters rather than as a single character. Such applications are usually referred to as surrogate-neutral or surrogate-safe, meaning that there is no intrinsic ability to interact with the data, but at least the data can be stored without loss.

However, there are a few things you should be aware of when using these older, surrogate-neutral, SQL Server versions:

Newer versions of SQL Server and surrogate pairs

Newer versions of SQL Server (2012 and later, version >= 11) support surrogate pairs, but you must set the collation to a one ending in _SC. For example, Latin1_General_100_CI_AS_SC. When you do this, string functions will recognise surrogate pairs and all of the problems listed above for older versions of SQL Server are fixed.

Is my SQL Server surrogate-neutral or surrogate-aware?

Here's a small script you can use to test whether your SQL Server is surrogate-neutral or surrogate-aware:

# Test to see if your SQL Server is surrogate-aware or just surrogate-neutral
use 5.008.001;
use strict;
use DBI qw(:utils);

my $h = DBI->connect() or die $DBI::errstr;
$h->{PrintError} = 0;

eval {$h->do(q/drop table surrogate_pairs/)};

# It is possible to set the collation at instance or database level.
# Set it on the column to make sure that, initially, we are using a
# non supplementary character collation.
$h->do(q/create table surrogate_pairs (a nvarchar(20) collate Latin1_General_100_CI_AI)/);

my $insert = $h->prepare(q/insert into surrogate_pairs values(?)/);

# Insert test supplementary character
print "Inserting Unicode character U+2070E into db\n";
$insert->execute("\x{2070E}");

# now read it back and see what we get
print "\nNote when we select this character back it is still 1 Unicode character and 4 bytes and the ord is correct at\n"
print "\n0x2070E. This is because DBD::ODBC received a buffer of SQL_WCHAR chrs back from SQL Server which it then\n"
print "\ndecoded as UTF-16 which recognises the surrogate pair. This is why SQL Server using this collation (or older\n"
print "\nSQL Servers)\n"
print "\nare known as surrogate-neutral.\n";
my $r = $h->selectrow_arrayref(q/select a from surrogate_pairs/);
print data_string_desc($r->[0]), "\n";
print "ord(chr): ", sprintf("0x%x", ord($r->[0])), "\n";

# This is a non _SC collation, so the length function returns "2".
print "\nNote in the following that len(a) returns 2 not 1 as SQL Server has not recognised this as a surrogate pair.\n";
$r = $h->selectrow_arrayref(q/select len(a) from surrogate_pairs/);
print "length in database is: ", $r->[0], "\n";

# now try to alter the table to change the collation to Latin1_General_100_CI_AS_SC
# which only later SQL Servers (>= version 11, i.e., 2012) can do.
# Unfortunately older SQL Servers don't error if you try to change the collation
# to one it does not support so we cannot test by just trying to change to a
# surrogate aware collation.
$h->do(q/alter table surrogate_pairs alter column a nchar(20) collate Latin1_General_100_CI_AS_SC/);

$r = $h->selectrow_arrayref(q/SELECT SERVERPROPERTY('ProductVersion')/);
my $version = $r->[0];
print "\nYou SQL Server is version: $version\n\n";

if ((split(/\./, $version))[0] >= 11) {
    print "Your SQL Server is surrogate-aware\n";
    $r = $h->selectrow_arrayref(q/select a from surrogate_pairs/);
    print data_string_desc($r->[0]), "\n";
    print "ord(chr): ", sprintf("0x%x", ord($r->[0])), "\n";

    print "\nNote in the following that len(a) returns 1 as SQL Server in this collation recognises surrogate pairs\n";
    $r = $h->selectrow_arrayref(q/select len(a) from surrogate_pairs/);
    print "length in database is: ", $r->[0], "\n";
} else {
    print "You SQL Server is surrogate-neutral but not surrogate-aware\n";
}
$h->disconnect;