BOM (Byte Order Mark CharSet Encoding)

Firstly, we should know about the byte order mark charset. Name byte order mark obtained from the original character name zero width no-break space. It is a Unicode character. It is always present like a magic number at the starting of the text stream. BOM use is optional if using then it should at the starting of the text stream. It can encode 8-bit, 16 bit and 32-bit integers. In this system receive the data or text from the source means which one giving the input and that one wants to know which byte order integers are encoded. Byte order sequence for every character in BOM differ for every Unicode encoding and no sequence number appear at the start of the data stream, to show this sequence number at the starting of the data stream we place an encoded BOM at the start of the text. To indicate or to identify the text stream encoding scheme is used. Characters are encoded into octets and then into URI characters. But in this method, many restrictions are there because some characters can create different-different security issues. A code can be in any language and IANA holds a different code number for different languages.

In Words, it simply converts the code into electronic form or in binary form because our computer always read data in the binary form so character encoding scheme simply converts the simple language into coded form. Characters are always stored in the computer in one byte or one to more bytes. Always characters are stored in the computer with special codes. Character encoding scheme always provides a code to break or crack the code and that code is the set of the mapping code between the bytes and the character set.  It can be 8 bit, 16 bit and 32 bit. All characters from the charset are registered with ISO standards. In these days UTF-8 and UTF-16 are in trend and these are used to encode the characters.

UTF is also known as Unicode and this is used to encode a single character. In character encoding method font should be fit on that format. Font also defined as ‘glyph’ definitions. It means what kind of shape should be used to display the characters. In this, we can also reuse the port for other services. In character encoding method font should be fit on that format. Font also defined as ‘glyph’ definitions. It means what kind of shape should be used to display the characters. UTF-8 not used in encoding scheme because UTF-8 used just to give input signal at the start of the text stream when the data stream is encoded in UTF-8.

In these days UTF-8 and UTF-16 are in trend and these are used to encode the characters. UTF is also known as Unicode and this is used to encode a single character.

In encoding scheme, we use UTF-8, UTF-16, and UTF-32 but UTF-8 also has no endianners.

Common BOMs

Encoding

Representation in hexadecimal

Representation in decimal

UTF-8

EF BB BF

239 187 191

UTF-16(BE)

FE FF

254 255

UTF-16(LE)

FF FF

255 254

UTF-32(BE)

00 00 FF FF

00 254 255

UTF-32(LE)

FF FE 00 00

255 254 00

UTF: Uniform Transformation Format is an algorithmic mapping scheme and maps the Unicode point to the unique byte sequence.

UTF-7: UTF-7 encoding similar to UTF-8 encoding scheme. In this only seven bits units and it is used in emails when client and server connectivity is there.

UTF-8: It is the representation of the BOM byte sequence in hexadecimal. UTF-8 not used in encoding scheme because UTF-8 used just to give input signal at the start of the text stream when the data stream is encoded in UTF-8.

UTF-16: It is a standard method used for encoding to Unicode character data. Unicode is designed to encode them in any form or in any language data format.UTF-16 encode the characters into a binary sequence using one or two 16 bit sequence.

UTF-32: It is the Unicode format in 32 bits and encode data 32 bit per encoding point.  UTF-32, UTF-32(BE) and UTF-32(LE) encodings are all fixed-length 32-bit (4-byte) Unicode character encodings.

-> Output byte streams of UTF-32 encoding may have 3 valid formats: Big-Endian without BOM, Big-Endian with BOM, and Little-Endian with BOM.

-> UTF-32BE encoding is the Big-Endian without BOM format of UTF-32 encoding.

->UTF-32LE encoding is the Little-Endian with BOM format of UTF-32 encoding without using BOM.

Mix Endianess: Endianess is the byte or can be a bit in memory is used to represent some kind of data.

BE: BE means big Endian and in this machine store the most significant byte on the lowest memory address.

LE: LE means little Endian and in this machine can store the least significant byte on the lowest memory address.

Important points about BOM:

  1. In Microsoft .txt files may need of the BOM on certain Unicode data streams, such as files.
  2. Some protocols allow optional BOMs in untagged text. In those cases,
    • Where a text data stream is known to be plain text, but unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding data can be anything.
    • Where a text data stream is known to be simple Unicode text then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.
  3. Some byte-oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided.
Encoding scheme and their properties

Encoding

Variable/Fixed

Minimum byte

Maximum byte

UTF-8

variable

1

4

UTF-16

variable

2

4

UTF-32

fixed

4

4

UTF-8 is the representation of the BOM byte sequence in hexadecimal and in these variables is used and minimum on byte or maximum four bytes can be encoded. In UTF-16 maximum two bytes and minimum of four bytes can be encoded but in UTF-32 maximum and minimum, both can encode four bytes.