Character Sets
How are characters represented?
- Computers only understand binary and therefore we need to represent characters using binary codes
- For example, the letter ‘A’ might be represented as 01000001 in binary
Character sets
- A character set is a list of all of the characters and their associated binary code
- Character sets standardize the binary codes for each character
- Without a character set, one system might interpret 01000001 differently from another
- Two common character sets are:
- American Standard Code for Information Interchange (ASCII)
- UNICODE
ASCII
- ASCII uses 7-bits to encode each character, providing for 128 distinct characters
- For example, ‘A’ is represented as 65 in decimal, which is 1000001 in binary
- ASCII was created to provide a common standard for encoding characters, which was necessary for compatibility among various types of hardware and software
- An extended version of ASCII exists which encodes each character using 8-bits creating 256 characters
ASCII table
- The ASCII table shows the relationship between characters that humans recognise and the denary values that represent them in the system
- The denary values can then be converted to binary, representing the original character as binary
ASCII Table
Limitations of ASCII
1. It has a limited number of characters
ASCII is limited to 128 characters, which include English alphabets, numerals, and some special and control characters.
A, B, C, ..., Z
a, b, c, ..., z
0, 1, ..., 9
!, @, #, ...
2. It is not suitable for multilingual text
ASCII cannot represent characters from languages other than English, limiting its applicability globally.
No representation for: 'α', 'ö', 'ñ',
3. There is no provision for modern symbols
ASCII does not include modern symbols or emoji’s common in today’s digital communication.
Unicode
- UNICODE was created to be a solution to the limitations of ASCII
- UNICODE uses a much larger bit range, up to 32-bits (depending on the encoding method), allowing for a wide variety of characters from different languages and scripts
- Example: The Greek character Lambda ‘λ’ is represented as U+03BB
- U+03BB breaks down to:
- U+, meaning this is a Unicode character
- 03BB, meaning character 03BB in the UNICODE set
Impact on storage
- ASCII is more storage-efficient, with characters requiring only 7-bits
- UNICODE characters can require up to 32-bits, thus potentially using more storage space
Comparison
| ASCII | UNICODE | |
|---|---|---|
| Encoding system | 7-Bits | 16-bits or 32-bits |
| Number of characters | 128 characters | 65,536 characters (16-bit) |
| Uses | Used to represent characters in the English language. | Used to represent characters across the world. |
| Benefits | It uses a lot less storage space than UNICODE. | It can represent more characters than ASCII. It can support all common characters across the world. It can represent special characters such as emoji’s. |
| Drawbacks | It can only represent 128 characters. It cannot store special characters such as emoji’s. | It uses a lot more storage space than ASCII. |