Quantcast
Channel: C++ Team Blog
Viewing all articles
Browse latest Browse all 1541

New Options for Managing Character Sets in the Microsoft C/C++ Compiler

$
0
0

The Microsoft C/C++ compiler has evolved along with DOS, 16-bit Windows, and 32/64-bit Windows.  Its support for different characters sets, code pages, and Unicode has also changed during this time.  This post will explain how our compiler has worked in the past and also cover some new switches provided by the C/C++ compiler in Visual Studio 2015 Update 2 CTP, specifically support for BOM-less UTF-8 files and controlling execution character sets.  Please download this and try it out.  For information on other compiler changes in Update 2, check out this post.

There are some great resources online that describe Unicode, DBCS, MBCS, code pages, and other things in great detail. I won’t try to reproduce that here and will cover the basic concepts quickly. The Unicode Consortium site is a great place to learn more about Unicode.

There are two main aspects to understanding how our compiler deals with different character sets. The first is how it interprets bytes in a source file (source character set) and the second is what bytes it writes into the binary (execution character set).  It is important to understand how the source code itself is encoded and stored on disk.

Explicit indication of Unicode encoding

There is a standard way to indicate Unicode files by using a BOM (byte-order mark). This BOM can indicate UTF-32, UTF-16, and UTF-8, as well as whether it is big-endian or little-endian. These are indicated by the sequence of bytes that results from the encoding of the U+FEFF character into whatever encoding is being used. UTF-8 is encoded as a stream of bytes, so there isn’t an actual “order” of the bytes that needs to be indicated, but the indicator for UTF-8 is still usually called a “BOM”.

Implicit indication of encoding

In the early days of Windows (and DOS) before Unicode was supported, text files were stored with no indication of what encoding the file was using. It was up to the app as to how to interpret this. In DOS, any character outside of the ASCII range would be output using what was built in to the video card. In Windows, this became known as the OEM (437) code page. This included some non-English characters as well as some line-drawing characters useful for drawing boxes around text.

Windows eventually added support for DBCS (double byte character sets) and MBCS (multi-byte character sets). There was still no standard way of indicating what the encoding of a text file was and the bytes would usually be interpreted using whatever the current code page of the system was set to. When 32bit Windows arrived, it had separate APIs for UTF-16 and another set for so-called “ANSI” APIs. These APIs took 8-bit characters that were interpreted using the current code page of the system.

Note: in Windows you cannot set the code page to a Unicode code page (either UTF-16 or UTF-8), so in many cases there is no easy way to make an older app understand a Unicode encoded file that does not have a BOM.

It is also common nowadays to encode files in UTF-8 without using a BOM. This is the default in most Linux environments. Although many Linux tools can handle a BOM, most tools won’t generate one. Not having a BOM actually makes many things simpler such as concatenating files or appending to a file without having to worry about who is going to write the BOM.

How the Microsoft C/C++ compiler reads text from a file

At some point in the past, the Microsoft compiler was changed to use UTF-8 internally. So, as files are read from disk, they are converted into UTF-8 on the fly. If a file has a BOM, we use that and read the file using whatever encoding is specified and converting it to UTF-8. If the file does not have a BOM, we try to detect both little-endian and big-endian forms of UTF-16 encoding by looking at the first 8 bytes. If the file looks like UTF-16 we will treat it as if there was a UTF-16 BOM on the file.

If there is no BOM and it doesn’t look like UTF-16, then we use the current code page (result of a call to GetACP) to convert the bytes on disk into UTF-8. This may or may not be correct depending on how the file was actually encoded and what characters it contains. If the file is actually encoded as UTF-8, this will never be correct as the system code page can’t be set to CP_UTF8.

Execution Character Set

It is also important to understand the “execution character set”. Based on the execution character set, the compiler will interpret strings differently. Let’s look at a simple example to start.

const char ch = ‘h';
const char u8ch = u8’h';
const wchar_t wch = L’h';
const char b[] = “h”;
const char u8b[] = u8″h”;
const wchar_t wb [] = L”h”;

The code above will be interpreted as though you had typed this.

const char ch = 0x68;
const char u8ch = 0x68;
const wchar_t wch = 0x68;
const char b[] = {0x68, 0};
const char u8b[] = {0x68, 0};
const wchar_t wb [] = {0x68, 0};

This should make perfect sense and will be true regardless of the file encoding or current code page. Now, let’s take a look at the following code.

const char ch = ‘屰';
const char u8ch = ‘屰';
const wchar_t wch = L’屰';
const char b[] = “屰”;
const char u8b[] = u8″屰”;
const wchar_t wbuffer[] = L”屰”;

Note: I picked this character at random, but it appears to be the Han character meaning “disobedient”, which seems appropriate for my purpose. It is the Unicode U+5C70 character.

We have several factors to consider in this. How is the file encoded that contains this code? And what is the current code page of the system we are compiling on? In UTF-16 the encoding is 0x5C70, in UTF-8 it is the sequence 0xE5, 0xB1, 0xB0. In the 936 code page, it is 0x8C, 0xDB. It is not representable in code page 1252 (Latin-1), which is what I’m currently running on. The 1252 code page is normally used on Windows in English and many other Western languages. Table 1 shows results for various file encodings when run on a system using code page 1252.

Table 1 – Example of results today when compiling code with various encodings.

File Encoding UTF-8 w/ BOM UTF-16LE w/ or w/o BOM UTF-8 w/o BOM DBCS (936)
Bytes in source file representing 屰 0xE5, 0xB1, 0xB0 0x70, 0x5C 0xE5, 0xB1, 0xB0 0x8C, 0xDB
Source conversion UTF8 -> UTF8 UTF16-LE -> UTF-8 1252 -> UTF8 1252 -> UTF-8
Internal (UTF-8) representation 0xE5, 0xB1, 0xB0 0xE5, 0xB1, 0xB0 0xC3, 0xA5, 0xC2, 0xB1, 0xC2, 0xB0 0xC5, 0x92, 0xC3, 0x9B
Conversion to execution character set
char ch = ‘屰';
UTF-8 -> CP1252
0x3F* 0x3F* 0xB0 0xDB
char u8ch = u8’屰';
UTF-8 -> UTF-8
error C2015 error C2015 error C2015 error C2015
wchar_t wch = L’屰';
UTF-8 -> UTF-16LE
0x5C70 0x5C70 0x00E5 0x0152
char b[] = “屰”;
UTF-8 -> CP1252
0x3F, 0* 0x3F, 0* 0xE5, 0xB1, 0xB0, 0 0x8C, 0xDB, 0
char u8b[] = u8″屰”;
UTF-8-> UTF-8
0xE5, 0xB1, 0xB0, 0 0xE5, 0xB1, 0xB0, 0 0xC3, 0xA5, 0xC2, 0xB1, 0xC2, 0xB0, 0 0xC5, 0x92, 0xC3, 0x9B, 0
wchar_t wb[] = L”屰”;
UTF-8 -> UTF-16LE
0x5C70, 0 0x5C70, 0 0x00E5, 0x00B1, 0x00B0, 0 0x0152, 0x00DB, 0

The asterisk (*) indicates that warning C4566 was generated for these. In these cases the warning is “character represented by universal-character-name ‘\u5C70′ cannot be represented in the current code page (1252)”
The error C2015 is “too many characters in constant”

These results probably doesn’t make nearly as much sense as the simple case of the letter ‘h’, but I’ll walk through what is going on in each case.

In columns one and two, we know what the encoding of the file is and so the conversion to the internal representation of UTF-8 is correctly 0xE5, 0xB1, 0xB0. The execution character set is Windows code page 1252, however, and when we try to convert the Unicode character U+5C70 to that code page, it fails and uses the default replacement character of 0x3F (which is the question mark). We emit warning C4566 but use the converted character of 0x3F. For the u8 character literal, we are already in UTF-8 form and don’t need conversion, but we can’t store three bytes in one byte and so emit error C2015. For wide literals, the “wide execution character set” is always UTF-16 and so the wide character and wide string are converted correctly. For the u8 string literal, we are already in UTF-8 form internally and no conversion is done.

In the third column (UTF-8 with no BOM), the on disk characters are 0xe5, 0xb1, and 0xb0. Each character is interpreted using the current code page of 1252 and converted to UTF-8, resulting in the internal sequence of three two-byte UTF-8 characters: (0xC3, 0xA5), (0xC2, 0xB1), and (0xC2, 0xB0). For the simple character assignment, the characters are converted back to codepage 1252, giving 0xE5, 0xB1, 0xB0. This results in a multicharacter literal and the results are the same as when the compiler encounters ‘abcd’. The value of a multicharacter literal is implementation defined and in VC it is an int where each byte is from one character. When assigning to a char, you get conversion and just see the low byte. For u8 character literals we generate error C2015 when using more than one byte. Note: The compiler’s treatment of multicharacter literals is very different for narrow chars and wide chars. For wide chars, we just take the first character of the multicharacter literal, which in this case is 0x00E5. In the narrow string literal, the sequence is converted back using the current code page and results in four bytes: 0xe5, 0xb1, 0xb0, 0. The u8 string literal uses the same character set as the internal representation and is 0xC3, 0xA5, 0xC2, 0xB1, 0xC2, 0xB0, 0.For a wide string literal, we use UTF-16 as the execution character set which results in 0x00E5, 0x00B1, 0x00B2, 0.

Finally, in the fourth column we have the file saved using code page 936, where the character is stored on disk as 0x8C, 0xDB. We convert this using the current code page of 1252 and get two two-byte UTF-8 characters: (0xC5, 0x92), (0xC3, 0x9B). For the narrow char literal, the characters are converted back to 0x8C, 0xDB and the char gets the value of 0xDB. For the u8 char literal, the characters are not converted, but it is an error. For the wide char literal, the characters are converted to UTF-16 resulting in 0x0152, 0x00DB. The first value is used and 0x0152 is the value. For string literals, the similar conversions are done.

Changing the system code page

The results for the second and third columns will also be different if a different code page than 1252 is being used. From the descriptions above, you should be able to predict what will happen in those cases. Because of these differences, many developers will only build on systems that are set to code page 1252. For other code pages, you can get different results with no warnings or errors.

Compiler Directives

There are also two compiler directives that can impact this process. These are “#pragma setlocale” and “#pragma execution_character_set”.

The setlocale pragma is documented somewhat here https://msdn.microsoft.com/en-us/library/3e22ty2t.aspx. This pragma attempts to allow a user to change the source character set for a file as it is being parsed. It appears to have been added to allow wide literals to be specified using non-Unicode files. However, there are bugs in this that effectively only allow it to be used with single-byte character sets. If you try to add a pragma set locale to the above example like this.

#pragma setlocale(“.936″)
const char buffer[] = “屰”;
const wchar_t wbuffer[] = L”屰”;
const char ch = ‘屰';
const wchar_t wch = L’屰';

The results are in Table 2, with the differences highlighted in Red. All it did was make more cases fail to convert and result in the 0x3F (?) character. The pragma doesn’t actually change how the source file is read, instead it is used only when wide character or wide string literals are being used. When a wide literal is seen, the compiler converts individual internal UTF-8 characters back to 1252, trying to “undo” the conversion that happened when the file was read. It then converts them from the raw form to the codepage set by the “setlocale” pragma. However, in this particular case, the initial conversion to UTF-8 in column 3 and column 4 results in 3 or 2 UTF-8 characters respectively. For example, in column 4, the internal UTF-8 character of (0xC5, 0x92) is converted back to CP1252, resulting in the character 0x8C. The compiler then tries to convert that to CP936. However, 0x8C is just a lead byte, not a complete character, so the conversion fails yielding 0x3F, the default replacement character. The conversion of the second character also fails, resulting in another 0x3F. So, column three ends up with three 0x3F characters for the wide string literal and column 4 has two 0x3F characters in the literal.

For a Unicode file with a BOM, the result is the same as before, which makes sense as the encoding of the file was strongly specified through the BOM.

Table 2 – Example of results today when compiling code with various encodings. Differences from Table 1 in red.

File Encoding UTF-8 w/ BOM UTF-16LE w/ or w/o BOM UTF-8 w/o BOM DBCS (936)
Bytes in source file representing 屰 0xE5, 0xB1, 0xB0 0x70, 0x5C 0xE5, 0xB1, 0xB0 0x8C, 0xDB
Source conversion UTF8 -> UTF8 UTF16-LE -> UTF-8 1252 -> UTF8 1252 -> UTF-8
Internal (UTF-8) representation 0xE5, 0xB1, 0xB0 0xE5, 0xB1, 0xB0 0xC3, 0xA5, 0xC2, 0xB1, 0xC2, 0xB0 0xC5, 0x92, 0xC3, 0x9B
Conversion to execution character set
char ch = ‘屰';
UTF-8 -> CP1252
0x3F* 0x3F* 0xB0 0xDB
char u8ch = u8’屰';
UTF-8 -> UTF-8
error C2015 error C2015 error C2015 error C2015
wchar_t wch = L’屰';
UTF-8 -> UTF-16LE
0x5C70 0x5C70 0x003F 0x003F
char b[] = “屰”;
UTF-8 -> CP1252
0x3F, 0* 0x3F, 0* 0xE5, 0xB1, 0xB0, 0 0x8C, 0xDB, 0
char u8b[] = u8″屰”;
UTF-8-> UTF-8
0xE5, 0xB1, 0xB0, 0 0xE5, 0xB1, 0xB0, 0 0xC3, 0xA5, 0xC2, 0xB1, 0xC2, 0xB0, 0 0xC5, 0x92, 0xC3, 0x9B, 0
wchar_t wb[] = L”屰”;
UTF-8 -> UTF-16LE
0x5C70, 0 0x5C70, 0 0x003F, 0x003F, 0x003F, 0 0x003F, 0x003F, 0

The other pragma that affects all of this is #pragma execution_character_set. It takes a target execution character set, but only one value is supported and that is “utf-8”. It was introduced to allow a user to specify a utf-8 execution character set and was implemented after VS2008 and VS2010 had shipped. This was done before the u8 literal prefix was supported and is really not needed any longer. At this point, we really encourage users to use the new prefixes instead of #pragma execution_character_set.

Summary of Current Issues

There are many problems with #pragma setlocale.

  1. It can’t be set to UTF-8, which is a major limitation.
  2. It only affects string and character literals.
  3. It doesn’t actually work correctly with DBCS character sets.

The execution_character_set pragma lets you encode narrow strings as UTF-8, but it doesn’t support any other character set. Additionally, the only way to set this globally is to use /FI (force include) of a header that contains this pragma.

Trying to compile code that contains non ASCII strings in a cross platform way is very hard to get right.

New Options in VS2015 Update 2

In order to address these issues, there are several new compiler command-line options that allow you to specify the source character set and execution character set. The /source-charset: option can take either an IANA character set name or a Windows code page identifier (prefixed with a dot).

/source-charset:<iana-name>|.NNNN

If an IANA name is passed, that is mapped to a Windows code page using IMultiLanguage2::GetCharsetInfo. The code page is used to convert all BOM-less files that the compiler encounters to its internal UTF-8 format. If UTF-8 is specified as the source character set then no translation is performed at all since the compiler uses UTF-8 internally. If the specified name is unknown or some other error occurs retrieving information on the code page, then an error is emitted. One limitation is not being able to use UTF-7, UTF-16, or any DBCS character set that uses more than two bytes to encode a character. Also, a code page that isn’t a superset of ASCII may be accepted by the compiler, but will likely cause many errors about unexpected characters.

The /source-charset option affects all files in the translation unit that are not automatically identified. (Remember that we automatically identify files with a BOM and also BOM-less UTF-16 files.) Therefore, it is not possible to have a UTF-8 encoded file and a DBCS encoded file in the same translation unit.

The /execution-charset:<iana-name>|.NNNN option uses the same lookup mechanism as /source-charset to get a code page. It controls how narrow character and string literals are generated.

There is also a /utf-8 option that is a synonym for setting “/source-charset:utf-8” and “/execution-charset:utf-8”.

Note that if any of these new options are used it is now an error to use #pragma setlocale or #pragma execution-character-set. Between the new options and use of explicit u8 literals, it should no longer be necessary to use these old pragmas, especially given the bugs. However, the existing pragmas will continue to work as before if the new options are not used.

Finally, there is a new /validate-charset option, which gets turned on automatically with any of the above options. It is possible to turn this off with /validate-charset-, although that is not recommended. Previously, we would do some validation of some charsets when converting to internal UTF-8 form, however, we would do no checking of UTF-8 source files and just read them directly, which could cause subtle problems later. This switch enables validation of UTF-8 files as well regardless of whether there is a BOM or not.

Example Revisited

By correctly specifying the source-charset where needed, the results are now identical regardless of the encoding of the source file. Also, we can specify a specific execution character set that is independent of the source character set and results should be identical for a specific execution character set. In Table 3, you can see that we now get the exact same results regardless of the encoding of the source file. The data in green indicates a change from the original example in Table 1.

Table 4 shows the results of using an execution character set of UTF-8 and Table 5 uses GB2312 as the execution character set.

Table 3 – Example using correct source-charset for each source file (current code page 1252). Green shows differences from Table 1.

File Encoding UTF-8 w/ BOM UTF-16LE w/ or w/o BOM UTF-8 w/o BOM DBCS (936)
Bytes in source file representing 屰 0xE5, 0xB1, 0xB0 0x70, 0x5C 0xE5, 0xB1, 0xB0 0x8C, 0xDB
Source conversion UTF8 -> UTF8 UTF16-LE -> UTF-8 UTF8 -> UTF8 CP936 -> UTF-8
Internal (UTF-8) representation 0xE5, 0xB1, 0xB0 0xE5, 0xB1, 0xB0 0xE5, 0xB1, 0xB0 0xE5, 0xB1, 0xB0
Conversion to execution character set
char ch = ‘屰';
UTF-8 -> CP1252
0x3F* 0x3F* 0x3F* 0x3F*
char u8ch = u8’屰';
UTF-8 -> UTF-8
error C2015 error C2015 error C2015 error C2015
wchar_t wch = L’屰';
UTF-8 -> UTF-16LE
0x5C70 0x5C70 0x5C70 0x5C70
char b[] = “屰”;
UTF-8 -> CP1252
0x3F, 0* 0x3F, 0* 0x3F, 0* 0x3F, 0*
char u8b[] = u8″屰”;
UTF-8-> UTF-8
0xE5, 0xB1, 0xB0, 0 0xE5, 0xB1, 0xB0, 0 0xE5, 0xB1, 0xB0, 0 0xE5, 0xB1, 0xB0, 0
wchar_t wb[] = L”屰”;
UTF-8 -> UTF-16LE
0x5C70, 0 0x5C70, 0 0x5C70, 0 0x5C70, 0

 

Table 4 – Using an execution character set of utf-8 (code page 65001) correct /source-charset for file encoding

File Encoding UTF-8 w/ BOM UTF-16LE w/ or w/o BOM UTF-8 w/o BOM DBCS (936)
Bytes in source file representing 屰 0xE5, 0xB1, 0xB0 0x70, 0x5C 0xE5, 0xB1, 0xB0 0x8C, 0xDB
Source conversion UTF8 -> UTF8 UTF16-LE -> UTF-8 UTF8 -> UTF8 CP936 -> UTF-8
Internal (UTF-8) representation 0xE5, 0xB1, 0xB0 0xE5, 0xB1, 0xB0 0xE5, 0xB1, 0xB0 0xE5, 0xB1, 0xB0
Conversion to execution character set
char ch = ‘屰';
UTF-8 -> UTF-8
0xB0 0xB0 0xB0 0xB0
char u8ch = u8’屰';
UTF-8 -> UTF-8
error C2015 error C2015 error C2015 error C2015
wchar_t wch = L’屰';
UTF-8 -> UTF-16LE
0x5C70 0x5C70 0x5C70 0x5C70
char b[] = “屰”;
UTF-8 -> UTF-8
0xE5, 0xB1, 0xB0, 0 0xE5, 0xB1, 0xB0, 0 0xE5, 0xB1, 0xB0, 0 0xE5, 0xB1, 0xB0, 0
char u8b[] = u8″屰”;
UTF-8-> UTF-8
0xE5, 0xB1, 0xB0, 0 0xE5, 0xB1, 0xB0, 0 0xE5, 0xB1, 0xB0, 0 0xE5, 0xB1, 0xB0, 0
wchar_t wb[] = L”屰”;
UTF-8 -> UTF-16LE
0x5C70, 0 0x5C70, 0 0x5C70, 0 0x5C70, 0

 

Table 5 – Using an execution character set of GB2312 (code page 936)

File Encoding UTF-8 w/ BOM UTF-16LE w/ or w/o BOM UTF-8 w/o BOM DBCS (936)
Bytes in source file representing 屰 0xE5, 0xB1, 0xB0 0x70, 0x5C 0xE5, 0xB1, 0xB0 0x8C, 0xDB
Source conversion UTF8 -> UTF8 UTF16-LE -> UTF-8 UTF8 -> UTF8 CP936 -> UTF-8
Internal (UTF-8) representation 0xE5, 0xB1, 0xB0 0xE5, 0xB1, 0xB0 0xE5, 0xB1, 0xB0 0xE5, 0xB1, 0xB0
Conversion to execution character set
char ch = ‘屰';
UTF-8 -> CP936
0xDB 0xDB 0xDB 0xDB
char u8ch = u8’屰';
UTF-8 -> UTF-8
error C2015 error C2015 error C2015 error C2015
wchar_t wch = L’屰';
UTF-8 -> UTF-16LE
0x5C70 0x5C70 0x5C70 0x5C70
char b[] = “屰”;
UTF-8 -> CP936
0x8C, 0xDB, 0 0x8C, 0xDB, 0 0x8C, 0xDB, 0 0x8C, 0xDB, 0
char u8b[] = u8″屰”;
UTF-8-> UTF-8
0xE5, 0xB1, 0xB0, 0 0xE5, 0xB1, 0xB0, 0 0xE5, 0xB1, 0xB0, 0 0xE5, 0xB1, 0xB0, 0
wchar_t wb[] = L”屰”;
UTF-8 -> UTF-16LE
0x5C70, 0 0x5C70, 0 0x5C70, 0 0x5C70, 0

Do’s, Don’ts, and the Future

On Windows, save files as Unicode with a BOM when possible. This will avoid problems in many cases and most tools support reading files with a BOM.

In those cases where BOM-less UTF-8 files already exist or where changing to a BOM is a problem, use the /source-charset:utf-8 option to correctly read these files.

Don’t use /source-charset with something other than utf-8 unless no other option exists. Saving files as Unicode (even BOM-less UTF8) is better than using a DBCS encoding.

Use of /execution-charset or /utf-8 can help when targeting code between Linux and Windows as Linux commonly uses BOM-less UTF-8 files and a UTF-8 execution character set.

Don’t use #pragma execution_character_set. Instead, use u8 literals where needed.

Don’t use #pragma setlocale. Instead, save the file as Unicode, use explicit byte sequences, or use universal character names rather than using multiple character sets in the same file.

Note: Many Windows and CRT APIs currently do not support UTF-8 encoded strings and neither the Windows code page nor CRT locale can be set to UTF-8. We are currently investigating how to improve our UTF-8 support at runtime. However, even with this limitation many applications on the Windows platform use UTF-8 encoding internally and convert to UTF-16 where necessary on Windows.

In a future major release of the compiler, we would like to change default handling of BOM-less files to assume UTF-8, but changing that in an update has the potential to cause too many silent breaking changes. Validation of UTF-8 files should catch almost all cases where that is an incorrect assumption, so my hope is that it will happen.


Viewing all articles
Browse latest Browse all 1541

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>