Differential compression and backup

Differential compression Most compression schemes are dealing with isolated data, while differential compression is about compressing data which relates to some other known data. This can be used to compress a newer version of a document, by basing the compression of an older version of the same document. This is also known as incremental compression, and depending on its use, as incremental backup. I was hired by a company providing backups, to come up with a usable method, which would allow a user to execute incremental backups, even though that user no longer had the older version of the data. That data was stored on a remote server, and it should not have to be transmitted to the user in order for him to calculate the difference. Instead a hashing method was used.

The document describing the method is located here, but it is written in Danish.

The demo project, which lets you play around with different coding blocks on your own data, is located here. I experimented with designing a cleaner user interface as well, and it turned out alright I think.

The general idea is that a user wants to take a backup of version1 of some data. He stores the entire file on a remote server and calculates a weak and a strong hash value for each block of the data. A block is a custom number of bytes of the data, starting at the beginning and counting through the whole file. The hash values are stored on the remote server as well. This takes up slightly more space than the data itself would do.

Some time later, the user has changed the data in some way, and wants to make a new backup. This time he looks at the first block of the new version and calculates a weak hash value. This value is compared to the hash values of the original version. If there is a match, he compares the strong hash values as well. If they also matches, then he trust that the blocks in the old and the new data are identical, and he makes a note of this in the differential backup file. If there is no match, then he stores the first byte of the current block, and slides the block one byte and then repeat the process. After a complete scan, he uploads the new hash values for the blocks along with the differential backup file. Unless the data has changes very much, this differential file and the hash values will take up far less space than the data itself. Now only the differences are stored.

The above description is very short, and may seem confusing, but the method is described in details in the document. Unfortunately the document is encrypted using the protocol known as Danish :-/ but if you are curious, feel free to drop me a mail.

Thomas Grønneløv,
Apr 12, 2011, 11:22 AM
Thomas Grønneløv,
Apr 12, 2011, 11:22 AM