UTF-8 with C++
UTF-8 is the variable width character encoding, which uses one to four 8 bits (1 byte) code units for the text representation. UTF-16 is also variable width encoding using one or two 16 bits (2 bytes) code units. UTF-32 is the fixed width encoding using exactly 32 bits (4 bytes) for each code point.
String representation
String literal is of type const char[]
. The class string
has constructor for a string literal. The size of a string literal
includes the terminating null character, while the string
excludes it.
string str = "Zdravo, Svete!"; const char s[] = "Zdravo, Svete!"; cout << str.size() << endl; // prints 14 cout << sizeof(s) / sizeof(*s) << endl; // prints 15
The UTF strings can be represented as:
const char s[] = u8"Здраво, Свете!";// until C++17 const string str = u8"Здраво, Свете!";// until C++17 const char8_t u8s[] = u8"Здраво, Свете!";// starting from C++20 const u8string u8str = u8"Здраво, Свете!";// starting from C++20 const char16_t u16s[] = u"Здраво, Свете!"; const u16string u16str = u"Здраво, Свете!"; const char32_t u32s[] = U"Здраво, Свете!"; const u32string u32str = U"Здраво, Свете!";
s
and u8s
are two bytes for every letter, plus one byte for the comma, space and exclamation mark, plus one byte for the null
character; in total that is 26 bytes for the size. The size of u16s
is 30 bytes, since each character takes two bytes. The size of
u32s
is 60 bytes, since each character takes four bytes.
Change in C++20
With the version C++20 there is a breaking change for UTF-8 strings. It's not possible anymore to assign the UTF-8 string literal to
char*
, but rather char8_t*
. That can be seen, for instance, with the overloaded function which takes both
string
and u8string
:
#include <iostream> #include <string> using namespace std; int len(string s) { cout << "len(string):" << endl; return s.size(); } #if __cplusplus > 201703L int len(u8string s) { cout << "len(u8string):" << endl; return s.size(); } #endif int main() { cout << len(u8"Здраво, Свете!") << endl; return 0; }
The char8_t*
string literal can be written onto a file but it's not the
content which is written. Similar change is for the string class. However, the u8string
cannot be written by the fstream
class.
#include <fstream> #include <string> using namespace std; int main() { char s1[] = "Здраво, Свете!"; string ss1 = "Здраво, Свете!"; #if __cplusplus > 201703L char8_t s2[] = u8"Здраво, Свете!"; u8string ss2 = u8"Здраво, Свете!"; #else char s2[] = u8"Здраво, Свете!"; string ss2 = u8"Здраво, Свете!"; #endif std::ofstream ofs("text.txt"); ofs << s1 << endl << s2 << endl; ofs << ss1 << endl; #if __cplusplus <= 201703L ofs << ss2 << endl; #endif return 0; }
Conversion
There is nothing in the string
class which enforces the UTF-8 or any other encoding. It is just a sequence of bytes. Thus, to switch
between the const char*
and const char8_t*
(and the corresponding string literals), the reinterpret cast may be used
(according to char8_t backward compatibility remediation"):
const char* s = reinterpret_cast<const char*>(u8"Здраво, Свете!");
const char8_t* u8s = reinterpret_cast<const char8_t*>("Здраво, Свете!");
For the string classes, a conversion from u8string
to string
and vice versa is made by using C strings:
u8string u8str1 = u8"Здраво, Свете!"; string s(reinterpret_cast<const char*>(u8str1.c_str())); cout << s << endl; u8string u8str2(reinterpret_cast<const char8_t*>(s.c_str())); cout << boolalpha << (u8str1 == u8str2) << endl; // prints true