Utf - 8 Encoding

Try First, Check Solution later

1. You should first read the question and watch the question video.
2. Think of a solution approach, then try and submit the question on editor tab.
3. We strongly advise you to watch the solution video for prescribed approach.

1. You are given an array of integers.
2. You are required to test whether the array represents a valid sequence of UTF-8 characters or
not.
3. A character in UTF-8 can be from 1 to 4 bytes long and follows some rules -
(i) For 1-byte long character, first bit will be 0 and rest represents its unicode code.
(ii) For n-bytes long character, first n-bits will be 1's, the n+1th bit is 0, followed by n-1 bytes
with most significant 2 bits being 10.

Note -> Only the least significant 8 bits of each element in array is used for data.
Note -> Check out the question video for details.
Input Format
A number N
arr1
arr2..
N numbers
Output Format
Check the sample output and question video.
Question Video
Constraints
1 <= n <= 10^9
0 < a[i] <= 255
Sample Input
3
197
130
1
Sample Output
true


  • Editorial

    The problem here deals with checking whether the particular sequence of data can form a valid UTF-8 encoded data or not.

    As per UTF-8 encoding, a character can be one to four bytes long and according to the length of the character rules are governing UTF-8 encoding:

    1. For a character of length 1 byte, the MSB should always begin with 0.

    2. For a character of length 2 bytes, the starting bits of the upper byte should be 110 and the lower byte should start with 10.

    3. For a character of length 2 bytes, the starting bits of the upper byte should be 1110 and the rest lower bytes should start with 10.

    4. For a character of length 4 bytes, the starting bits of the upper byte should be 11110 and the rest lower bytes should start with 10.

    So to summarize:

    1 Byte -> 0 _ _ _ _ _ _ _

    2 Bytes -> [1 1 0 _ _ _ _ _ ] [1 0 _ _ _ _ _ _ ]

    3 Bytes -> [1 1 1 0 _ _ _ _ ] [1 0 _ _ _ _ _ _ ] [1 0 _ _ _ _ _ _ ]

    4 Bytes -> [1 1 1 1 0 _ _ _ ] [1 0 _ _ _ _ _ _ ] [1 0 _ _ _ _ _ _ ] [1 0 _ _ _ _ _ _ ]

    Let us take an example to understand the problem better,

    Sequence -> 11101111, 10111100, 10010111, 11011111, 10000000, 11110010, 10010101, 10001111, 10111111

    As this sequence is of type 3 bytes -> 2 bytes -> 4 bytes and abides all rules of UTF-8 encoding henceforth it is a valid sequence.

    Let us take another example,

    Sequence -> 11101111, 10111100, 10010111, 1011111, 01000000, 11110010, 10010101, 10001111, 10111111

    As this sequence is of type 3 bytes ->{redundant} -> 1 byte -> 4 bytes so it fails for byte at index 3rd where we have a redundant byte starting from 10... so this is an invalid sequence.

    The idea behind this problem is to traverse through the entire array and check for every value that whether it lies in 1 byte, 2 bytes, 3 byte, or 4-byte criteria by checking its MSB bits with 0, 110, 1110, and 11110 and as per the result check further (n - 1) bytes for the type where MSB bits starts from 10, here n represent the byte type, hence n = 1, 2, 3 or 4. If at any instant there occurs an ambiguity then the sequence is not valid.

    Suppose we need to check whether the given val is of type 3 byte:

    If((val>>4) == 0b1110) => true then it is of type 3 byte

    Here we have shifted the bits of val 4 sides to the right which makes us consider the last 4 bits of the byte and now if these bits are 1110 then the criteria are met. We can apply this method to check for every type.

    Time Complexity: O(n)

    The time complexity for the function is linear as we are traversing on the entire array once.

    Space Complexity: O(1)

    The space complexity for the function is constant.

  • Asked in Companies
  • Related Topics






Video Solution

Code Solution

Run
 
Run
Id Name