Character Sets

Computers only understand binary and therefore we need to represent characters using binary codes
For example, the letter ‘A’ might be represented as 01000001 in binary

A character set is a list of all of the characters and their associated binary code
Character sets standardize the binary codes for each character
Without a character set, one system might interpret 01000001 differently from another
Two common character sets are:
- American Standard Code for Information Interchange (ASCII)
- UNICODE

ASCII

ASCII uses 7-bits to encode each character, providing for 128 distinct characters
For example, ‘A’ is represented as 65 in decimal, which is 1000001 in binary
ASCII was created to provide a common standard for encoding characters, which was necessary for compatibility among various types of hardware and software
An extended version of ASCII exists which encodes each character using 8-bits creating 256 characters

The ASCII table shows the relationship between characters that humans recognise and the denary values that represent them in the system
The denary values can then be converted to binary, representing the original character as binary

ascii-table ASCII Table

1. It has a limited number of characters

ASCII is limited to 128 characters, which include English alphabets, numerals, and some special and control characters.

A, B, C, ..., Z
a, b, c, ..., z
0, 1, ..., 9
!, @, #, ...

2. It is not suitable for multilingual text

ASCII cannot represent characters from languages other than English, limiting its applicability globally.

No representation for: 'α', 'ö', 'ñ',

3. There is no provision for modern symbols

ASCII does not include modern symbols or emoji’s common in today’s digital communication.

UNICODE was created to be a solution to the limitations of ASCII
UNICODE uses a much larger bit range, up to 32-bits (depending on the encoding method), allowing for a wide variety of characters from different languages and scripts
- Example: The Greek character Lambda ‘λ’ is represented as U+03BB
- U+03BB breaks down to:
  - U+, meaning this is a Unicode character
  - 03BB, meaning character 03BB in the UNICODE set

ASCII is more storage-efficient, with characters requiring only 7-bits
UNICODE characters can require up to 32-bits, thus potentially using more storage space

	ASCII	UNICODE
Encoding system	7-Bits	16-bits or 32-bits
Number of characters	128 characters	65,536 characters (16-bit)
Uses	Used to represent characters in the English language.	Used to represent characters across the world.
Benefits	It uses a lot less storage space than UNICODE.	It can represent more characters than ASCII. It can support all common characters across the world. It can represent special characters such as emoji’s.
Drawbacks	It can only represent 128 characters. It cannot store special characters such as emoji’s.	It uses a lot more storage space than ASCII.