Character Sets

How are characters represented?

  • Computers only understand binary and therefore we need to represent characters using binary codes
  • For example, the letter ‘A’ might be represented as 01000001 in binary

Character sets

  • A character set is a list of all of the characters and their associated binary code
  • Character sets standardize the binary codes for each character
  • Without a character set, one system might interpret 01000001 differently from another
  • Two common character sets are:
    • American Standard Code for Information Interchange (ASCII)
    • UNICODE

ASCII

  • ASCII uses 7-bits to encode each character, providing for 128 distinct characters
  • For example, ‘A’ is represented as 65 in decimal, which is 1000001 in binary
  • ASCII was created to provide a common standard for encoding characters, which was necessary for compatibility among various types of hardware and software
  • An extended version of ASCII exists which encodes each character using 8-bits creating 256 characters

ASCII table

  • The ASCII table shows the relationship between characters that humans recognise and the denary values that represent them in the system
  • The denary values can then be converted to binary, representing the original character as binary

ascii-table ASCII Table

Limitations of ASCII

1. It has a limited number of characters

ASCII is limited to 128 characters, which include English alphabets, numerals, and some special and control characters.

A, B, C, ..., Z
a, b, c, ..., z
0, 1, ..., 9
!, @, #, ...

2. It is not suitable for multilingual text

ASCII cannot represent characters from languages other than English, limiting its applicability globally.

No representation for: 'α', 'ö', 'ñ',

3. There is no provision for modern symbols

ASCII does not include modern symbols or emoji’s common in today’s digital communication.

Unicode

  • UNICODE was created to be a solution to the limitations of ASCII
  • UNICODE uses a much larger bit range, up to 32-bits (depending on the encoding method), allowing for a wide variety of characters from different languages and scripts
    • Example: The Greek character Lambda ‘λ’ is represented as U+03BB
    • U+03BB breaks down to:
      • U+, meaning this is a Unicode character
      • 03BB, meaning character 03BB in the UNICODE set

Impact on storage

  • ASCII is more storage-efficient, with characters requiring only 7-bits
  • UNICODE characters can require up to 32-bits, thus potentially using more storage space

Comparison

ASCIIUNICODE
Encoding system7-Bits16-bits or 32-bits
Number of characters128 characters 65,536 characters (16-bit)
UsesUsed to represent characters in the English language.Used to represent characters across the world.
BenefitsIt uses a lot less storage space than UNICODE. It can represent more characters than ASCII. It can support all common characters across the world. It can represent special characters such as emoji’s.
DrawbacksIt can only represent 128 characters. It cannot store special characters such as emoji’s.It uses a lot more storage space than ASCII.