Skip to content
Deep Dive: Why Does C/C++ Frequently Encounter Chinese Garbled Characters Under Windows?

Deep Dive: Why Does C/C++ Frequently Encounter Chinese Garbled Characters Under Windows?

1. Core Concept: The Misalignment of Characters and Bytes

To understand the root cause of encoding corruption, one must first distinguish between “Characters” and “Bytes.”

  • Character: A human-readable symbol (such as ‘A’ or ‘你’).
  • Byte: A binary value stored by a computer.

The legacy design of the C/C++ language (originating in the 1970s) led to a core misunderstanding: the char type essentially stores bytes rather than characters.

1
2
3
char s[] = "你好";
// Actual memory storage (under UTF-8): {0xe4, 0xbd, 0xa0, ..., 0x00}
// Not the intuitive: {'你', '好', '\0'}

When using printf or cout, they do not automatically handle complex encoding conversions, which directly leads to performance differences across platforms.

2. Cross-Platform Disparity: Heaven and Hell

  • Linux / macOS: Almost seamless. Compilers, source files, and terminal environments use UTF-8 by default across the entire chain, just like the internet standard. This is why you rarely encounter garbled characters when writing C++ on these systems.
  • Windows: For historical reasons, Windows chose UTF-16 as its kernel encoding. To maintain compatibility with legacy software, it introduced the “Code Page” mechanism, resulting in the current chaotic state.

3. The Windows “Code Page” Trap

The default encoding of the Windows terminal depends on the System Locale, rather than Unicode. Unless you manually enable the Beta option “Use Unicode UTF-8…”, the system forces different code pages based on the language:

Language EnvironmentCode PageEncoding StandardRemarks
Simplified ChineseCP936GBKEasily confused with UTF-8, causing corruption
Traditional ChineseCP950Big5Commonly used in Hong Kong and Taiwan
English (Default)CP437/1252OEM/ANSIDoes not support non-Western European characters
JP/KR/RUCP932/949/1251Shift-JIS, etc.Each acting independently

The root of garbled text usually lies in the inconsistency of encoding across these four stages:

  1. Source File Encoding: VS might save as GBK by default.
  2. Compiler Execution Character Set: How the compiler parses string constants.
  3. IO Streams / Standard Library: The behavior of std::cout depends on the locale.
  4. Terminal Display Encoding: Whether the console is set to GBK or UTF-8.

Visual Studio defaults to the system encoding (GBK). If the code is opened in someone else’s UTF-8 environment, or forced to compile with UTF-8 but run in a GBK terminal, corruption is inevitable. Although this can be mitigated via plugins or the /utf-8 compiler option, the coordination between the standard library (std::cout) and system APIs (SetConsoleOutputCP) remains cumbersome.

4. Best Practice: UTF-8 over UTF-16

The most robust solution for outputting multi-language text on Windows is: Process data internally using UTF-8 uniformly, and convert it to UTF-16 when outputting to call Win32 APIs.

The following is a high-performance implementation example combining C++23 features with the simdutf library:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#include <windows.h>
#include <string_view>
#include <string>
// Assume the simdutf library has been included for high-performance conversion

void print_utf8(const std::u8string_view input) {
  if (input.empty()) return;

  // 1. Calculate the UTF-16 length required for conversion
  const auto output_length = simdutf::utf16_length_from_utf8(
      reinterpret_cast<const char *>(input.data()), input.size());

  if (output_length <= 0) return;

  // 2. Use C++23 resize_and_overwrite for in-place writing, avoiding redundant copies
  std::u16string output{};
  output.resize_and_overwrite(output_length, [&](char16_t *buffer, size_t) {
    return simdutf::convert_utf8_to_utf16(
        reinterpret_cast<const char *>(input.data()), input.size(), buffer);
  });

  // 3. Directly call the Win32 API to output wide characters, bypassing the cout locale trap
  if (const auto handle = GetStdHandle(STD_OUTPUT_HANDLE);
      handle && handle != INVALID_HANDLE_VALUE) {
    WriteConsoleW(handle, output.c_str(), static_cast<DWORD>(output.size()), nullptr, nullptr);
  }
}

// Usage Example
int main() {
    print_utf8(u8"Hello, 世界 ☺️");
    return 0;
}

It doesn’t end there—the even more desperate part: the standard library has essentially “walked away.”

If you naively browse the C++ standard documentation looking for ready-made conversion functions like Python’s .encode() or Java’s .getBytes(), congratulations, you’ve fallen into a pit.

  1. The History of “Slacking Off” in the Standard Library The C/C++ standard library’s handling of encoding issues can be described as “long-term absence, occasional twitching.” After finally introducing std::codecvt in C++11, it was so poorly designed and bug-ridden that C++17 simply deprecated it. This has created an absurd reality: to this day, there is not a single easy-to-use, non-deprecated, cross-platform encoding conversion function in the C++ standard library.

  2. Why Can’t You Write Your Own? “Isn’t it just bit-shifting operations? Can’t I write a conversion function myself?” Absolutely not. Unicode is extremely complex, involving BOM headers, Surrogate Pairs (for instance, the 😁 emoji requires two storage units), illegal sequence detection, and endianness issues. If you reinvent the wheel, there is a 99% probability you will create vulnerabilities or cause a crash when encountering rare characters.

  3. The Saviors: ICU and simdutf Since the stepmother (the standard library) won’t help, we must find the biological father (third-party libraries).

If you need to perform complex layout and multi-language processing similar to a web browser, use ICU (libicu). It is the industry-recognized “big brother”—heavy as a brick, but solid as a rock. If you simply want to convert UTF-8 to UTF-16 to feed into Windows APIs as described in this article, then simdutf is the best choice. It utilizes CPU “black magic” (SIMD instructions) for acceleration, making it so fast you can’t even see its taillights, and it is lightweight. So, stop rummaging through the trash heap of the standard library; the right path is to simply vcpkg install simdutf.

Last updated on