<- Go Back

This website uses Huffman Coding to compress data!

Huffman Coding is a popular technique used for Lossless Data Compression.

Lossles Compression:

Lossless compression

reversible

GZIP

Brotli

WebP

PNG

Lossles Compression is preffered for text file compression, while Lossy Compression is generally preferred for audio, video and image files.

ZIP

MP3

JPEG

Data compression ratio

Greedy Algorithm:

1) Kruskal’s Minimum Spanning Tree (MST): In Kruskal’s algorithm, we create an MST by picking edges one by one. The Greedy Choice is to pick the smallest weight edge that doesn’t cause a cycle in the MST constructed so far
2) Prim’s Minimum Spanning Tree: In Prim’s algorithm also, we create a MST by picking edges one by one. We maintain two sets: a set of the vertices already included in MST and the set of the vertices not yet included. The Greedy Choice is to pick the smallest weight edge that connects the two sets
3) Dijkstra’s Shortest Path: Dijkstra’s algorithm is very similar to Prim’s algorithm. The shortest-path tree is built up, edge by edge. We maintain two sets: a set of the vertices already included in the tree and a set of the vertices not yet included. The Greedy Choice is to pick the edge that connects the two sets and is on the smallest weight path from the source to the set that contains not yet included vertices
4) Huffman Coding: Huffman Coding is a loss-less compression technique. It assigns variable-length bit codes to different characters. The Greedy Choice is to assign the least bit length code to the most frequent character.

Huffman Code:

Huffman code

variable-length code table

Prefix Codes

no ambiguity

A Huffman coding tree built from this character frequency table: A=0.6, B=0.2, C=0.07, D=0.06, E=0.05, F=0.02.

Compression:

Huffman Encoding is an algorithm which uses frequency (or probability) feature of symbols and a binary tree structure. It consists of the following 3 steps:
1. Probability Calculation & Ordering the Symbols
2. Binary Tree Transformation
3. Assigning Codes to the Symbols
We count the number of each symbol in the whole data, then we calculate the “probability” of each symbol by dividing that count by the total number of characters in the data. Since its an algorithm using probability, more common symbols — the symbols having higher probability — are generally represented using fewer bits than less common symbols. This is one of the advantageous sides of Huffman Encoding.
As an example, for the following data having 5 different symbols as A B C D E, we have the probabilities as shown right:

Then we easily order the symbols according to their probabilities representing each symbol as a node and call that our “collection”. Now, we are ready to pass the next step.

Binary Tree Transformation:

1. From the collection, we pick out the two nodes with the smallest sum of probabilities and combine them into a new tree whose root has the probability equal to that sum.

2. We add the new tree back into the collection.

As a conclusion, we see that the compression ratio does not change with the growing amount of data, and this ratio is close to 2:1. We can say that Huffman Encoding is an algorithm that compresses the data to its half size. Although it is old, it is still an effective compression algorithm!

De-compression:

Having our Binary Huffman Tree obtained during encode phase, decoding is a very simple process to perform.
Let’s consider we have the same example with Huffman Encoding post, therefore we have AAAAAAABCCCCCCDDEEEEE as our initial data and 000000000000001110101010101011101101010101010 as encoded output with the following Huffman Tree:

Now the only thing we should do is starting from the head of Huffman Tree and from the beginning of the encoded data, each time we encounter 1 we go right and while we encounter 0, we go left through the Huffman Tree. When we reach at a leaf node, we obtain the symbol! Then we just start again from the head of Huffman Tree while moving forward on encoded data.
Data compression is a topic of many applications and it has various different types of algorithms beside of “frequency based” Huffman Algorithm. You can check for “dictionary based” methods like LZ77 LZ78 LZW which are useful for image compression especially.