You are hereBlogs / Robert Peglar's blog / Data Reduction for Cloud Storage
Data Reduction for Cloud Storage
One of the most important factors to consider when using cloud storage is the cost of transport. Most cloud storage providers charge you not only for data storage, but also for transporting the data (upload/download) as well. It is these charges which often form the vast majority of cloud storage cost.
To minimize these costs, it is vital to remember the three techniques of data reduction; compression, incrementalization, and deduplication. Let's take a brief look at all three of these valuable techniques.
Compression is most certainly not new, but it is coming back 'into vogue' as a method of data reduction. Most filesystems have had compression built-in for years, but the CPU horsepower required was daunting, and performance suffered. Today, however, the CPUs are very powerful and it is advisable to use compression wherever possible. Even a 2:1 reduction in data provides significant savings, and most compression algorithms in use should be able to achieve that ratio.
Incrementalization (one of my favorite seven-syllable words!) is another old technique which is finding new favor. It is also known as 'versioning' - i.e. the technique of storing only the ongoing changes to a file instead of the entire file. Many of us have adopted the technique of versioning via subtle changes in filenames - i.e. foo1.txt, foo2.txt, and so on. While this may be useful, it's certainly wasteful of storage - the various files may only be a few bytes different out of millions. It's very good practice to use incrementalization when possible, such as in a file archive or shared repository. Incrementalization is especially useful for long-duration files, i.e. those updated and stored for months or years.
Finally, deduplication has arisen over the last few years as a powerful and popular technique. While this blog will not go into excruciating detail on the topic - there are some excellent works already, including SNIA Tutorials (www.snia.org/education/tutorials) - it is useful to discuss dedupe in the context of cloud storage. The most common technique today is to perform dedupe in the data that is already cloud-resident, thus saving space and therefore money in cloud storage charges. Later, when the content is downloaded, you save even more by only downloading already deduped data, then performing the "un-dedupe" (a.k.a. rehydration) locally. Dedupe ratios of 10:1 or greater are not unheard of, especially for datasets such as backups or email stores.
There are certainly many variations on this theme, but one thing is clear - as long as there are charges for data transport in cloud storage, data reduction is in order. It's pretty obvious how data reduction can help you save money and time for your cloud-stored files - after all, the less data you have to store and transport, the less it costs!
- Robert Peglar's blog
- Login or register to post comments














