ARTICLE TABLE OF CONTENTS
This is about ASCII vs. Unicode vs. UTF-7 vs. UTF-8 vs. UTF-32 vs. ANSI.
Have you ever wondered about this HTML tag: <meta charset=”UTF-8″>?
With the help of this article, you will not only make sense of it, but also learn:
- What ASCII, Unicode, UTF-7, UTF-8, UTF-32, and ANSI are
- What’s the difference between them
- Lots more
Let’s dive right in!
What Is the Difference Between ASCII, Unicode, UTF-7, UTF-8, UTF-32, and ANSI?
In your travels, you’ve likely seen at least one of the terms ASCII, Unicode, UTF-7, UTF-8, UTF-16, UTF-32, and ANSI. However, what you may be less familiar with is the actual difference between these seven terms.
Navigating the battle of ASCII vs. Unicode vs. UTF-7 vs. UTF-8 vs. UTF-16 vs. UTF-32 vs. ANSI and what they’re best for or used for can be a complicated journey. But, fear not, it’s not impossible to understand.
To fully understand these terms, it’s good to start with a “picture this.”
Why Are We Even Here?
Picture a typewriter. You probably know how one looks. Now, picture a typewriter hammer.
If you’ve never seen a typewriter hammer, it’s a tiny rectangle with a character on the top and the bottom. The hammer often has lowercase and uppercase letters, but sometimes digits and symbols.
In the days of typewriters, the characters available to you were what you saw on the keyboard. This number of characters was the number of keys on the keyboard, plus the extra character on top of the type hammer.
You acquired extra characters by literally “shifting” the typebars to change the alignment of where the key punched and get a new character. With computer keyboards, character sets can go beyond what’s mechanically possible of a keyboard.
You know that modern computer keyboards accommodate far more than that. But, on early computers, your choice of characters wasn’t much better than a typewriter. The original American standard of characters only had 128 characters, and it was called ASCII.
From Morse Code to Typewriters: Meet ASCII
ASCII stands for American Standard Code for Information Interchange. It became available as an electronic communication standard in 1963. It’s foundational to mostly all other modern character sets and encoding standards, which expand upon ASCII’s original 128 characters.
Despite being nearly half a century old, it remains a pillar of the encoding landscape. Web pages with no character set designated automatically default to ASCII.
ASCII’s development came out of telegraph code. Like Morse code, ASCII depends on different combinations of negative or positive indicators which, together, equal a given character.
Morse code uses dots and dashes and ASCII uses 1’s and 0’s, but the premise is the same.
ASCII was, in a sense, a breakout invention. It made the list of International Electrical and Electronic Engineering Milestones, along with feats such as High-definition television, Compact Disc (CD) players, and the birth of the internet. ASCII predates all of these milestones.
There exsists also an extension to the basic ASCII table that adds another 128 characters to the original 128 characters. It’s called extented ASCII or EASCII.
Therefore, the basic and extended ASCII tables combined total in 256 characters.
It’s a Big World Afterall: Unicode Means Universal
Unicode is an encoding standard. It expanded and built upon the original ASCII character set, whose 128 characters comprise the first characters of Unicode.
As of March 2020, Unicode covers a whopping 143,859 characters, including the original ASCII set and thousands of more characters belonging to both English and other languages’ characters and glyphs.
Even in its initial version, Unicode provided 7,163 total characters, roughly fifty-five times the number of characters from ASCII.
Computers advanced drastically from the early ‘60s to 1987, the year that saw the origin of Unicode. Developed by Xerox’s Joe Becker and Lee Collins and Apple’s Mark Davis, Unicode intended to give computers multilingual and international capabilities.
Each subsequent edition added (or occasionally removed) various scripts. Scripts are collections of characters included in a character set, usually concerning different languages and alphabets, such as Greek or Han.
Unicode standards are implemented by either UTF-8, UTF-16, or UTF-32.
The Forgotten Unicode Format: UTF-7
UTF-7 stands for 7-bit Unicode Transformation Format. Likely, you don’t often see this as a standardized encoding format. It may also jump out at you that you’ll see numbers that are all multiples of eight (such as the other UTF codes included), and this one is seven.
It never gained traction and is now considered obsolete.
In the computer world, seven is a really weird number. Computers like multiples of two and multiples of eight. The only odd number it truly likes is one, but only as an identifier as being positive (vs. negative) or true (vs. false). This comparison ends up being a set of two, anyway.
ASCII also ran on 7 bits, but this was formative and early in the days of encoding.
UTF-7 intended to be a less demanding email alternative to UTF-8, but its lack of security made it a poor choice. UTF-7 didn’t last very long and is often overlooked.
Striking Gold: UTF-8 Is Born
UTF-8 is as close to the holy grail of character encoding as you get, providing a large library of scripts while not overloading computers with unnecessary rendering. It’s the standard format on the World Wide Web.
In 1989, the International Organization for Standardization sought to craft a universal multi-byte character set. The early UTF-1 encoding temporarily accomplished this goal. However, due to issues such as the reuse of ASCII printing characters, it was quickly replaced by UTF-8.
UTF-8 was established in 1992 and has remained the standard encoding format since then. It has an additional bit, compared to ASCII’s 7 bits, which allows for an increased number of characters it can handle.
Adding another bit into the mix meant that UTF-8 could allow for more characters. However, a 1-byte code in UTF-8 is the same as the ASCII character set. This is because ASCII still forms the foundation of UTF-8 and is therefore included in its set.
Don’t Look Back: UTF-16 Leaves ASCII Behind It
In the late ‘80s, both the Unicode Consortium and the standardization subcommittee ISO/IEC JTC 1/SC 2 began work on a universal character set. Strangely (and confusingly) enough, this original “Universal Character Set” garnered the name “Unicode”, but this later changed to UCS-2. This may cause some confusion, but Unicode almost always means the standard.
The ISO/IEC JTC 1/SC 2 first draft of guidelines called ISO 10646 eventually became UTF-16 and UCS-2.
UTF-16 came out of the earlier UCS-2 encoding when it became evident that more than 65,000-plus code points would be needed, which is what UTF-8 provided. However, UTF-16’s character mapping did not match ASCII and it is not backward-compatible with it.
Although usable, this lack of compatibility with ASCII makes UTF-16 occasionally troublesome.
Designed as a variable-width encoding, this is sometimes a point of controversy. It slowed down rendering time. Even though the character set as a whole intended to include more characters, this occurred at cost of efficiency.
Larger Than Life: UTF-32’s Enormous Library
If you recall UTF-16’s history from a few moments ago, you should remember something called ISO 10646. It was a draft of guidelines that eventually became a standard. Sometime after UTF-16, ISO 10646 defined an encoding form that ran on 32 bits.
They called this encoding form UCS-4.
UCS-4 had a massive range of code points. These code points started at 0 and went all the way to 0x7FFFFFFF. Eventually, the RFC 3629 imposed restrictions on Unicode.
RFC 3629 reigned UCS-4 into matching the constraints of UTF-16. This limited version of UCS-4 became what’s now known as UTF-32.
Following this, the ISO/IEC JTC 1/SC 2 stated in a procedural document that all future code points for both UCS-4 and UTF-32 would be constrained to the Unicode standard. Since UTF-32 can now represent all of USC-4’s code points, they are now considered virtually identical.
ANSI: The Misnomer That Keeps on Giving
The American National Standards Institute, or ANSI, started as the American Standards Association. This association was responsible for the development of ASCII. In 1969, it changed its name to ANSI.
If you dig a little, you’ll find that ANSI is an organization and not a standard or a character encoding at all.
Or is it?
ANSI is, in fact, a character set. But, it’s also a misnomer. ANSI code’s true name is Windows-1252 or Windows-CP and is not a standard that’s recognized by the American National Standards Institute.
Windows historically used the “ANSI code page” as a term for any non-DOS encodings. It utilizes the original 128 ASCII characters plus an additional 127. Like ASCII, it is a single-byte character encoder and is the most popular encoder of that type in the world.
As said before, ANSI is a misnomer but persists as a term to this day to refer to Windows-1252.
What Are Standards?
By now, you’ve noticed that this set of terms is divided into two categories: encodings and standards. While there is some crossover between the two, each term is primarily one or the other.
A standard, in the technical world, is an “established norm or requirement for a repeatable task.” Emphasis on “repeatable task.” In any technical setting, no event or activity is any good unless it’s repeatable. This is especially true in the computer world.
Of the listed terms, three of them are officially considered standards. These are ASCII, Unicode, and ANSI. Although the Unicode Standards govern UTF encodings, they are not standards themselves.
ASCII is a bit of a strange one as it is both an encoding and a standard. This is understandable, though, given how its origin predates many of the norms that institutions recognize today.
The Forerunner Standard: ASCII
As an initial standard, ASCII was first published in 1963. But computers in 1963 weren’t what computers are today. They weren’t even what they were in the ‘80s.
ASCII was designed to put control codes and graphic controls in two separate groups. Meaning codes that signify a space or delete follow each other and characters such as letters or numbers come after that.
ASCII pattern is also called ASCII betical order, and the two main points with this are that all uppercase letters precede all lowercase letters. Not meaning that it goes “A, a, B, b”, but rather “A, B, C” and after those letters conclude, start with “a, b, c” after “Z” (uppercase).
Since its development, ASCII has gone through twelve revisions, the last one being in 1986.
The committee that developed ASCII originally was the American Standards Association, which is now known as the American National Standards Institute. Commonly known as ANSI.
The Universal Standard: Unicode
The largest difference between Unicode and ASCII is just that: its largeness.
A draft of guidelines by ISO/IEC JTC 1/SC 2 created in 1990 eventually became the Unicode Standard with additions and alterations over time, aiming to include as many characters as possible.
Unicode itself is not an encoding; it leaves that business up to UTF-8 and its friends. The standard itself provides code pages, as well as guidelines for normalization, rendering, etc.
UTF-7, UTF-8, UTF-16, and UTF-32 are all implementations of the Unicode standard. Although listed here, UTF-7 is not considered an official Unicode Standard encoding.
The Unicode standard possesses a codespace divided into seventeen planes. This codespace is a set of numerical ranges that span from 0 through 10FFFF and are called code points. Each plane contains a range within these values.
Unicode denotes its code with a U at the front (“U+”) followed by a code point in hexadecimal value. U+0000 and U+10FFFF are both examples of recognized denotation for Unicode.
Why Not UTF?
UTF-7, UTF-8, UTF-16, and UTF-32 are not standards and therefore cannot be explained as such. Doing so would be like describing a cookbook as the ingredients and tools for the meal.
A cookbook tells you what can be cooked and how to cook it, and in that sense, it is like a standard. Ingredients and tools allow you to implement that and create a meal. As necessary as directions can be, a meal is nothing without the actual food and objects needed to create it!
These four UTF character sets are all referred to as encodings. Meaning, they are the tool that allows a user to request a character, send a signal through the computer, and be brought back as viewable text on the screen.
The Unicode standard is implemented by encodings, of which UTF-8, UTF-16, and UTF-32 are the most popular.
What Is an Encoding?
First, try to get past the awkward phrasing of the word. Yes, it sounds like a verb, but it’s a noun, like a drawing or a painting. An encoding involves implementing a collection of characters.
When processed through an encoding such as UTF-8, characters are assigned an integer so that they can manifest as characters.
Within the encoding sphere, several terms define aspects of encoding.
- Character: The smallest possible unit of text. Could be the letter “G” or space or a return
- Character Set: Collection of characters. These are not limited to one language, such as English and French being separate languages but use the same Latin characters
- Coded Character Set: Unlike character sets which have no numeric alignment, coded character sets assign integers to each character
- Code Point: The corresponding integer within a character set that gives it a value that can be referred to
- Code Space: Range of integers comprised of code points
These terms outline the basis of encoding terms.
It’s Great to Be UTF-8
When data is being processed, it tallies its bits. A bit is the single unit of information you’ll see represented as either a “1” (true) or a “0” (false) in a binary number such as 0011 (the 4-bit binary number for the number three).
UTF-8 is an 8-bit encoding, unlike ASCII, which is 7-bit. The number three above is a 4-bit binary number. Eight bits will always make up a byte.
The reason ASCII is called 7-bit is that the leading integer is always zero, forcing the computer to ignore it and only acknowledge the other seven bits of information.
UTF-8 is 8-bit, but both UTF-8 and ASCII can output one byte. UTF-8 can go up to four bytes, but it can tolerate ASCII’s single-byte function and is therefore backward-compatible.
And that’s what makes it gold.
UTF-8’s flexibility can both handle large-byte situations but still be quick enough to handle single bytes and not be overblown with unnecessary byte weight.
UTF-16 Jumps the Gun
The thing that makes UTF-16 a bit of a stick in the mud is that UTF-16, at a minimum, requires at least two bytes. ASCII can only do one byte. While UTF-8 can do up to four bytes, it’s minimum matches ASCII on that score: they can both do one byte.
As UTF-16 moves information forward in batches of two bytes, if it attempts to move something in ASCII, it loses code.
Since ASCII can only move one byte, UTF-16 pulls ahead with only half of the required data. You’re left with a code point whose value is only the first half of what it should be, which gives you an entirely different character.
Other than size, this becomes the major difference between UTF-16 and UTF-8. UTF-8 can backtrack and meld with ASCII, while UTF-16 jumps the gun and dismisses ASCII’s smallness and cannot properly process its encoding.
UTF-32: Great Big Bytes in the Sky
So, there’s UTF-8 which can have one to four bytes, there’s UTF-16 which needs at least two bytes, and then there’s UTF-32. UTF-32 requires no less than four bytes. That’s big.
Picture this. You have a baseball you’re going to send to your cousin Kevin. Postage is based on the weight and size of the package. You decide to send the baseball in a box that could hold a basketball.
You pay twice as much as you should have because you allowed for a much larger item in the box which wasn’t necessary.
UTF-32 is like that. It takes more time to transport 32 bits and it takes more space to store it.
The benefit is that less calculation is needed to determine which character needs to be rendered.
Both UTF-8 and UTF-16 must determine how many bytes something is to compute it, which is more time. UTF-32 only knows four bytes. More time is spent elsewhere, but less time on this calculation.
What Are Character Sets?
In general, character sets represent character encoding where each character is assigned a number. Since computers need not relate directly to binary numbers anymore, most character sets have a number that’s easier to identify.
These numbers translate to binary numbers, which tells the computer what character you want. The numbers you see are generally hexadecimal, and often have a special denotation depending on which standard they adhere to.
As you’ve probably guessed by now, character sets are not just letters or numbers. They’re also more than punctuation and symbols. Essentially, anything you could tell your keyboard to do is a character. Space? Character. Return? Character. Emoji? Character. Delete? Character.
As said before, ASCII, Unicode, and ANSI are not character sets. They are standards. They determine how character sets are implemented, but the true implementation of characters and what the characters are is determined by encoding.
ASCII as a Character Set
ASCII is the basic, foundational character set. It’s also the only term among the seven listed that serves as both a character set and a standard. But, now it’s time to focus on it as a character set.
ASCII codes 128 characters into 7-bit integers. The first 32 (0-31) characters are called “control codes” which existed to control physical hardware. Characters numbered from 32 to 127 comprise the printable characters of ASCII, except for the last character, which is Delete.
The printable characters of ASCII consist of the Latin uppercase letters, lowercase letters, the digits 0-9, and fourteen punctuation marks. The characters were determined in the United States by an American.
At that time, ASCII intended its characters solely for American or English use.
It does not include special script or phonetic characters such as À or œ, despite their presence in the Latin characters. This development would not occur until later.
Control Characters: Function Behind the Scenes
Control characters began as a way to manipulate the computer’s hardware. This occurred back in the time of ASCII’s invention in the early ‘60s when computers functioned more mechanically.
Some control characters you’re likely to recognize still, though perhaps by different names. “Return carriage” is a term that’s leftover from typewriter days, it’s equivalent to pressing the “enter key”.
Other control characters have become more obscure or deeply concealed within the functions of a computer.
It’s important to understand that in early computers, control codes served to allow anything a computer could do, such as acknowledge “end of medium” the computer reached the end of a piece of paper. This isn’t something we input today, but still something the computer reads.
Something that ASCII, ANSI, and Unicode all have in common are these control codes. The first 32 characters, plus 127 for Delete, are control codes for each of these character sets. Or, in the case of Unicode, for the UTF character sets that Unicode oversees.
Unicode and UTF Character Sets
As you recall, Unicode intends to encapsulate as many characters from as many languages as possible. Its first version from 1991 added to the original ASCII character set a library of 24 scripts. These included alphabets such as Hebrew, Arabic, and Hiragana.
Every one to two years following this first edition, Unicode adds a varying number of scripts to its repertoire. While these are usually languages or linguistic alphabets, sometimes a version adds specialty symbols, such as playing card symbols or emojis.
UTF-8, UTF-16, and UTF-32 handle the same character sets and libraries. They encode differently which alters their usage, but otherwise, they are identical in the characters they provide.
Although Unicode administers ASCII’s original 128 characters, it also updated (but did not replace) a handful of them.
For example, Unicode provides ASCII’s original cent sign (¢) but also a full-width cent sign (￠) which occupies a larger size within a character place.
ANSI (or Windows-1252) as a Character Set
To reiterate: ANSI is a misnomer and the character set that can be referred to as “ANSI” often means Windows-1252 instead. But for now, it’ll be called ANSI to avoid confusion.
Like ASCII, ANSI is a character set of the basic Latin letters. This includes the classic ASCII characters, such as the control codes, uppercase letters, lowercase letters, digits, and punctuation, but also extra script letters such as Č and ű.
ANSI also features additional currency symbols, such as the cent sign (¢), yen (¥), and the English pound (£). These are all, however, still considered a part of the Latin characters, like all of the ASCII characters.
Even though ANSI included twice the number of characters as ASCII, all of the characters are still Latin ones. An expansion into other characters, such as Japanese or Greek, wouldn’t occur until the release of Unicode.
Where Are They Now?
Other than UTF-7, all of the other encodings and standards listed here are still used to some extent. ASCII is considered the founding father of all modern encoding, serving both as a character set and a standard. None of the other terms hold this distinction.
Unicode serves exclusively as a standard. Standards themselves are not character sets or encodings, but rather they monitor and provide guidelines. Unicode oversees UTF-8, UTF-16, and UTF-32 as implementations of the standards it upholds.
The acronym ANSI stands for American National Standards Institute but is often a misnomer for the code page Windows-1252. It does not seem likely that this will be clarified any time soon.
UTF-8 remains the most widely used implementation of Unicode, differing only from UTF-16 and UTF-32 in the way it processes and delivers memory in the computer.
ASCII the Classic
Despite its age, ASCII is not considered obsolete and maintains its foundational status for Unicode and other encodings.
ASCII is still significant in the modern era. HTML incorporates shortcuts that allow you to input ASCII characters, such as a tilde (~) or other characters that are occasionally lost in web rendering.
This is accomplished by signaling the browser in HTML code with an ampersand (&) followed by a pound sign (#) with the correlating decimal number of the character you want.
The ASCII control codes do not bear any significance for HTML and cannot be accessed the same way the printable characters can. Although the control codes’ names seem antiquated and unused, many of them still perform actions within the computer.
The main difference is that users no longer need to activate these commands themselves and are generally an automatic function of modern computers.
The State of ANSI (or Windows-1252)
For one thing, ANSI still gets called ANSI and not Windows-1252 as much as it should be.
It’s also considered Extended ASCII, which makes sense given that half of it is the same as ASCII. Unlike ASCII, however, it’s not standardized by the American National Standards Institutes from which it gets its misnomer.
As of October 2020, 0.4% of websites worldwide stated they used Windows-1252. The encoding ISO/IEC 8859-1 is considered identical to this, which is used by 1.9% of all websites. This brings Windows-1252’s usage to 2.3%, more than UTF-16 or UTF-32.
Despite being used by more than 30,000 websites, Windows-1252 has not received a version update since its fourth and final version debuted in Microsoft Windows 98.
Unlike ASCII or Unicode, ANSI is not universal across all operating systems. Microsoft created it for the sole use of its Windows products, and as a result, that is the only place it’s used.
The Unicode Standard Today and Little UTF-7
Unicode is no longer considered an encoding and is exclusively recognized as a standard or referenced as a shorthand for the Unicode Standard Consortium. The Unicode Standard currently holds 143,859 characters in its repertoire.
The consortium released Unicode 13.0 in March of 2020 which added 5,930 characters and four scripts to its library. No other standard matches it in expansiveness and variety.
However, it’s dealt with some issues, especially concerning Han unification, which is a large undertaking to properly implement and include the variety of Han characters.
A portion of this controversy lay with international issues among Eastern Asian countries which use the Han characters.
UTF-7 remains obscure. Never officially acknowledged by the Unicode Consortium, it’s the term on this list that could be considered a failure. Unicode has no plans to resurrect or adapt it, as it never fulfilled its original purpose well in the first place.
UTF-8: King of the Web
UTF-8 has remained a mainstay since its development in 1992. As of 2020, around 96% of all web pages use UTF-8. It’s backward-compatible with ASCII, despite ASCII being 7-bit and UTF-8 being 8-bit.
If you right-click and select “view page source” on any given web page, you’re likely to find a designation for “UTF-8” as the character set. Despite UTF-7’s intention to handle emails, UTF-8 also serves as the encoding for almost all email services.
The versatility of UTF-8 also allows it to display the same code points as UTF-16 and UTF-32. Their difference mostly lies in each encoding’s process of data. UTF-8 remains the most efficient for the web, and even in some cases is preferred elsewhere.
Certain programming languages use UTF-8, such as PHP, which is a scripting language generally suited for web use and communicates with databases through web browsers.
UTF-16: Selective Usage
Since UTF-16 is not backward-compatible with ASCII, this makes it inadvisable for web use. As a result, it hasn’t picked up as much popularity as either UTF-8 or UTF-32.
Because UTF-16 is a variable-width encoding, it requires the extra knowledge of how many bytes it’s being handed. This takes up extra memory which slows down rendering time. The web uses UTF-16 less than 0.01% of the time, compared to UTF-8’s web usage of 96%.
Despite these issues, especially when used on the web, UTF-16 is commonly used in Java and Windows. It’s rarely used for Unix, such as Linux and subsequently Apple and Android.
However, as of May 2019, even Microsoft Windows seems to be veering away from UTF-16 and preferring UTF-8, which it now supports and recommends.
UTF-32’s current use focuses primarily on internal APIs (Application Programming Interface). Because UTF-32 doesn’t need to preprocess variable-width encoding, it’s faster for APIs.
Unix and the programming language Python possess the ability to utilize UTF-32. However, UTF-8 still often wins as a top choice for encoding. Even though many programs can handle it, UTF-8 serves as being more efficient and compact than UTF-32.
APIs that function internally only have to make calls to the operating system. However, this requires both the operating system and the software to possess the ability to handle UTF-32. UTF-32’s usage here demands less on an operating system than on a web browser.
However, both the software and the operating system must be communicating using UTF-32. Otherwise, the attempt is fruitless.
Despite being usable, UTF-32 still is not as viable as UTF-8 and not used nearly as often because it consumes so much more memory.