Introduction to
huffman coding
In computer science and information
theory, a Huffman code is
a particular type of optimal prefix code that
is commonly used for lossless data compression. The process of finding
and/or using such a code proceeds by means of Huffman coding,
an algorithm developed by David A. Huffman while
he was a Ph.D. student at MIT, and published in the 1952 paper "A
Method for the Construction of Minimum-Redundancy Codes".[1]
The output from Huffman's
algorithm can be viewed as a variable-length code table
for encoding a source symbol (such as a character in a file). The algorithm
derives this table from the estimated probability or frequency of occurrence (weight)
for each possible value of the source symbol. As in other entropy encoding methods,
more common symbols are generally represented using fewer bits than less common
symbols. Huffman's method can be efficiently implemented, finding a code in linear time to
the number of input weights if these weights are sorted.[2] However, although optimal among methods
encoding symbols separately, Huffman coding is not always optimal among all
compression methods.
History about David
A Huffman
In 1951, David A. Huffman and
his MIT information
theory classmates were given the choice of a
term paper or a final exam. The professor, Robert M. Fano, assigned a term paper on
the problem of finding the most efficient binary code. Huffman, unable to prove
any codes were the most efficient, was about to give up and start studying for
the final when he hit upon the idea of using a frequency-sorted binary tree and
quickly proved this method the most efficient.[3]
In doing so, Huffman outdid
Fano, who had worked with information
theory inventor Claude Shannon to
develop a similar code. By building the tree from the bottom up instead of the
top down, Huffman avoided the major flaw of the suboptimalShannon-Fano
coding.
Terminology
Huffman coding uses a specific
method for choosing the representation for each symbol, resulting in a prefix
code (sometimes called "prefix-free codes", that is, the bit string
representing some particular symbol is never a prefix of the bit string
representing any other symbol). Huffman coding is such a widespread method for
creating prefix codes that the term "Huffman code" is widely used as
a synonym for "prefix code" even when such a code is not produced by
Huffman's algorithm.
Basic techniques
like,,,,
Compression
A source
generates 4 different symbols with
probability . A binary
tree is generated from left to right taking the two least probable symbols and
putting them together to form another equivalent symbol having a probability
that equals the sum of the two symbols. The process is repeated until there is
just one symbol. The tree can then be read backwards, from right to left,
assigning different bits to different branches. The final Huffman code is:
Symbol
|
Code
|
a1
|
0
|
a2
|
10
|
a3
|
110
|
a4
|
111
|
The
standard way to represent a signal made of 4 symbols is by using 2 bits/symbol,
but the entropy of
the source is 1.74 bits/symbol. If this Huffman code is used to represent the
signal, then the average length is lowered to 1.85 bits/symbol; it is still far
from the theoretical limit because the probabilities of the symbols are
different from negative powers of two.
The technique works by creating
a binary tree of
nodes. These can be stored in a regular array, the size of which
depends on the number of symbols, . A node can be either a leaf node or
an internal node. Initially,
all nodes are leaf nodes, which contain the symbol itself,
the weight (frequency
of appearance) of the symbol and optionally, a link to a parent node
which makes it easy to read the code (in reverse) starting from a leaf node.
Internal nodes contain symbol weight,
links to two
child nodes and
the optional link to a parent node.
As a common convention, bit '0' represents following the left child and bit '1'
represents following the right child. A finished tree has up to leaf nodes and internal nodes. A
Huffman tree that omits unused symbols produces the most optimal code lengths.
The process essentially begins
with the leaf nodes containing the probabilities of the symbol they represent,
then a new node whose children are the 2 nodes with smallest probability is
created, such that the new node's probability is equal to the sum of the
children's probability. With the previous 2 nodes merged into one node (thus
not considering them anymore), and with the new node being now considered, the
procedure is repeated until only one node remains, the Huffman tree.
The simplest construction
algorithm uses a priority queue where
the node with lowest probability is given highest priority:
1.
Create a leaf node for each symbol and add it to the priority
queue.
2.
While there is more than one node in the queue:
1.
Remove the two nodes of highest priority (lowest probability)
from the queue
2.
Create a new internal node with these two nodes as children and
with probability equal to the sum of the two nodes' probabilities.
3.
Add the new node to the queue.
3.
The remaining node is the root node and the tree is complete.
Since efficient priority queue
data structures require O(log n) time per
insertion, and a tree with n leaves
has 2n−1 nodes, this algorithm operates in O(n log n)
time, where n is
the number of symbols.
If the symbols are sorted by
probability, there is a linear-time (O(n))
method to create a Huffman tree using two queues,
the first one containing the initial weights (along with pointers to the
associated leaves), and combined weights (along with pointers to the trees)
being put in the back of the second queue. This assures that the lowest weight
is always kept at the front of one of the two queues:
1.
Enqueue all leaf nodes into the first queue (by probability in
increasing order so that the least likely item is in the head of the queue).
2.
While there is more than one node in the queues:
1.
Dequeue the two nodes with the lowest weight by examining the
fronts of both queues.
2.
Create a new internal node, with the two just-removed nodes as
children (either node can be either child) and the sum of their weights as the
new weight.
3.
Enqueue the new node into the rear of the second queue.
3.
The remaining node is the root node; the tree has now been
generated.
How to generate Huffman Encoding Trees
This section provides practice in
the use of list structure and data abstraction to manipulate sets and trees.
The application is to methods for representing data as sequences of ones and
zeros (bits). For example, the ASCII standard code used to represent text in
computers encodes each character as a sequence of seven bits. Using seven bits
allows us to distinguish 27, or 128, possible different characters.
In general, if we want to distinguish n different symbols, we will need to use bits per symbol. If all our
messages are made up of the eight symbols A, B, C, D, E, F, G, and H, we can
choose a code with three bits per character, for example
A 000
|
C 010
|
E 100
|
G 110
|
B 001
|
D 011
|
F 101
|
H 111
|
BACADAEAFABBAAAGAH
is
encoded as the string of 54 bits
001000010000011000100000101000001001000000000110000111
Codes
such as ASCII and the A-through-H code above are known as fixed-length codes, because they represent each symbol in the message with
the same number of bits. It is sometimes advantageous to use variable-length codes, in which different symbols may be represented by
different numbers of bits. For example, Morse code does not use the same number
of dots and dashes for each letter of the alphabet. In particular, E, the most
frequent letter, is represented by a single dot. In general, if our messages
are such that some symbols appear very frequently and some very rarely, we can
encode data more efficiently (i.e., using fewer bits per message) if we assign
shorter codes to the frequent symbols. Consider the following alternative code
for the letters A through H:
A 0
|
C 1010
|
E 1100
|
G 1110
|
B 100
|
D 1011
|
F 1101
|
H 1111
|
With
this code, the same message as above is encoded as the string
100010100101101100011010100100000111001111
This
string contains 42 bits, so it saves more than 20% in space in comparison with
the fixed-length code shown above.
One of the difficulties of using
a variable-length code is knowing when you have reached the end of a symbol in
reading a sequence of zeros and ones. Morse code solves this problem by using a
special separator code (in this case, a pause) after the
sequence of dots and dashes for each letter. Another solution is to design the
code in such a way that no complete code for any symbol is the beginning (or prefix) of the code for
another symbol. Such a code is called a prefix
code. In the example above, A is encoded by 0 and B is encoded by 100, so
no other symbol can have a code that begins with 0 or with 100.In general, we can attain significant savings if we use variable-length prefix codes that take advantage of the relative frequencies of the symbols in the messages to be encoded. One particular scheme for doing this is called the Huffman encoding method, after its discoverer, David Huffman. A Huffman code can be represented as a binary tree whose leaves are the symbols that are encoded. At each non-leaf node of the tree there is a set containing all the symbols in the leaves that lie below the node. In addition, each symbol at a leaf is assigned a weight (which is its relative frequency), and each non-leaf node contains a weight that is the sum of all the weights of the leaves lying below it. The weights are not used in the encoding or the decoding process. We will see below how they are used to help construct the tree.
Given a Huffman tree, we can find the encoding of any symbol by starting at the root and moving down until we reach the leaf that holds the symbol. Each time we move down a left branch we add a 0 to the code, and each time we move down a right branch we add a 1. (We decide which branch to follow by testing to see which branch either is the leaf node for the symbol or contains the symbol in its set.) For example, starting from the root of the tree in figure , we arrive at the leaf for D by following a right branch, then a left branch, then a right branch, then a right branch; hence, the code for D is 1011.
To decode a bit sequence using a Huffman tree, we begin at the root and use the successive zeros and ones of the bit sequence to determine whether to move down the left or the right branch. Each time we come to a leaf, we have generated a new symbol in the message, at which point we start over from the root of the tree to find the next symbol. For example, suppose we are given the tree above and the sequence 10001010. Starting at the root, we move down the right branch, (since the first bit of the string is 1), then down the left branch (since the second bit is 0), then down the left branch (since the third bit is also 0). This brings us to the leaf for B, so the first symbol of the decoded message is B. Now we start again at the root, and we make a left move because the next bit in the string is 0. This brings us to the leaf for A. Then we start again at the root with the rest of the string 1010, so we move right, left, right, left and reach C. Thus, the entire message is BAC.
Generating Huffman trees
Given an ``alphabet'' of symbols and their relative frequencies, how do we construct the ``best'' code? (In other words, which tree will encode messages with the fewest bits?) Huffman gave an algorithm for doing this and showed that the resulting code is indeed the best variable-length code for messages where the relative frequency of the symbols matches the frequencies with which the code was constructed. We will not prove this optimality of Huffman codes here, but we will show how Huffman trees are constructed.
The algorithm for generating a Huffman tree is very simple. The idea is to arrange the tree so that the symbols with the lowest frequency appear farthest away from the root. Begin with the set of leaf nodes, containing symbols and their frequencies, as determined by the initial data from which the code is to be constructed. Now find two leaves with the lowest weights and merge them to produce a node that has these two nodes as its left and right branches. The weight of the new node is the sum of the two weights. Remove the two leaves from the original set and replace them by this new node. Now continue this process. At each step, merge two nodes with the smallest weights, removing them from the set and replacing them with a node that has these two as its left and right branches. The process stops when there is only one node left, which is the root of the entire tree. Here is how the Huffman tree of figure was generated:
Initial leaves
|
{(A 8) (B 3) (C 1) (D 1) (E 1) (F 1) (G 1) (H 1)}
|
Merge |
{(A 8) (B 3) ({C D} 2) (E 1) (F 1) (G 1) (H 1)}
|
Merge |
{(A 8) (B 3) ({C D} 2) ({E F} 2) (G 1) (H 1)}
|
Merge |
{(A 8) (B 3) ({C D} 2) ({E F} 2) ({G H} 2)}
|
Merge |
{(A 8) (B 3) ({C D} 2) ({E F G H} 4)}
|
Merge |
{(A 8) ({B C D} 5) ({E F G H} 4)}
|
Merge |
{(A 8) ({B C D E F G H} 9)}
|
Final merge |
{({A B C D E F G H} 17)}
|
Gathered
by Mirwise Bhatti Mcs 3rd
No comments:
Post a Comment