UTF-8 Decoding of Unicode

This program decodes a UTF-8 string and returns it’s corresponding Unicode code point (Ex: the string 0xCE, 0xA9 returns U+03A9). The goal for this program is to explain the concept of UTF-8 and the process of decoding a UTF-8 string into a Unicode character at the bit level.

Background

UTF-8 is an encoding standard that is used for electronic communication. It is most commonly used for encoding characters, using a variable-length encoding ranging from one to four bytes. It is an extention to the original ASCII standard and it is also backwards compatible. There was also another encoding aimed to extend ASCII known as ISO/IEC 8859, however, it was quickly taken over by UTF-8 due to it’s fragmented design. The problem with ISO 8859 was that it had different versions for specific regions, which made it impossible to mix different languages in one document.

Decoding Process

To begin decoding, we must convert our string to binary and then combine them into 8-bit chunks. This is not done in the code, but it is shown here to demonstrate the concept. Ex: CEA9

0xCE        0xA9
1100 1110   1010 1001
11001110    10101001

The first byte is very important. It will determine how many bytes there are in the string. This will be necessary to determine the number of continuation bits. Here is a chart which shows the number of bytes in the string as determined by the first bit.

First Byte	bytes
0 xxxxxxx	1
110 xxxxx	2
1110 xxxx	3
11110 xxx	4

Here are the conditions in the code to check for this at the bit level

#define UTF8_BYTE_ONE_MIN   0x00
#define UTF8_BYTE_ONE_MAX   0x7F
#define UTF8_BYTE_TWO_MIN   0xC0
#define UTF8_BYTE_TWO_MAX   0xDF
#define UTF8_BYTE_THREE_MIN 0xE0
#define UTF8_BYTE_THREE_MAX 0xEF
#define UTF8_BYTE_FOUR_MIN  0xF0
#define UTF8_BYTE_FOUR_MAX  0xF7

uint8_t returnValue;
if      (UTF8_BYTE_ONE_MIN <= value && value <= UTF8_BYTE_ONE_MAX)        returnValue = 1;
else if (UTF8_BYTE_TWO_MIN <= value && value <= UTF8_BYTE_TWO_MAX)        returnValue = 2;
else if (UTF8_BYTE_THREE_MIN <= value && value <= UTF8_BYTE_THREE_MAX)    returnValue = 3;
else if (UTF8_BYTE_FOUR_MIN <= value && value <= UTF8_BYTE_FOUR_MAX)      returnValue = 4;
else                                                                      returnValue = 0;
return returnValue;

In this case, the first byte is prefixed with 110

1100 - 1110    10 - 101001

Notice that the last byte is prefixed with 10. This is a continuation byte. Every continuation byte is prefixed with 10. In the program, we use the following conditon to check if the continuation bits are valid. But first, we must eliminate the framing bits from the first byte, which in this case are 110x - xxxx. The following conditions is used to do this.

#define UTF8_BYTE_ONE_MIN   0x00
#define UTF8_BYTE_ONE_MAX   0x7F
#define UTF8_BYTE_TWO_MIN   0xC0
#define UTF8_BYTE_TWO_MAX   0xDF
#define UTF8_BYTE_THREE_MIN 0xE0
#define UTF8_BYTE_THREE_MAX 0xEF
#define UTF8_BYTE_FOUR_MIN  0xF0
#define UTF8_BYTE_FOUR_MAX  0xF7

uint8_t returnValue;
if      (UTF8_BYTE_ONE_MIN <= value && value <= UTF8_BYTE_ONE_MAX)        returnValue = value;
else if (UTF8_BYTE_TWO_MIN <= value && value <= UTF8_BYTE_TWO_MAX)        returnValue = value & UTF8_BYTE_TWO_MASK;
else if (UTF8_BYTE_THREE_MIN <= value && value <= UTF8_BYTE_THREE_MAX)    returnValue = value & UTF8_BYTE_THREE_MASK;
else if (UTF8_BYTE_FOUR_MIN <= value && value <= UTF8_BYTE_FOUR_MAX)      returnValue = value & UTF8_BYTE_FOUR_MASK;
else                                                                      returnValue = 0;
return returnValue;

sum += bytes[0];    // this is done outside of the function
                    // once the framing bits are eliminated.

and to check if the continuation bytes are valid

// apply a bitwise AND to only keep the 10 in 10xx xxxx
// and then return 0 or 1
#define UTF8_DATA_BIT_MASK         0xC0
#define UTF8_CONTINUATION_BIT_MASK 0x80
return ((value & UTF8_DATA_BIT_MASK) == UTF8_CONTINUATION_BIT_MASK);

Finally, we removing the framing bits from the continuation bytes and sum all of the bytes, which will equal our Unicode character in hexadecimal.

#define UTF8_FRAMING_BIT_MASK 0x3F
#define UTF8_SHIFT_LENGTH     0x06

// removing framing bits
bytes[i] = bytes[i] & UTF8_FRAMING_BIT_MASK;

// add to sum
sum = sum << UTF8_SHIFT_LENGTH; // apply a left-shift operation to easily add the bits to the sum
sum += bytes[i];                // add the bits to the sum

Now we get our final value

1110 101001
U+03A9