Choosing The Right Data Deduplication Solution

By Fadi Albatal, Director of Marketing, FalconStor Software Data deduplication is the only way to dramatically reduce data volumes, slash storage requirements and minimize data protection costs and risks. Find out how to pick the right deduplication solution for your company.

Fadi Albatal, Director of Marketing, FalconStor Software

Dec 9, 2008

The investment banking system in the United States experienced a financial crisis late in the third quarter of 2008. Companies in all industries are deploying strict expense controls as a result of experiencing lower revenues. Every IT department in the manufacturing world is feeling the pressure. The directive now (and for the foreseeable future) is to reduce capital expenditures, lower operating costs and save energy.

Now is a time for manufacturing IT professionals to think outside the box and investigate technologies that can result in greater efficiency and return on investment. What may have simply been a good idea before is now a necessity, which is why the adoption of deduplication technology has increased.

The next evolutionary step in backup technology has widely been regarded to be deduplication. The benefits are tangible and extremely practical: eliminating duplicate data in secondary storage archives can slash costs, streamline management tasks and minimize the bandwidth required to replicate data. In short, deduplication improves efficiency and saves money.

There are a wide variety of providers of deduplication solutions, so how does a manufacturing company choose and deploy the right one?

Each vendor lays claim to having the best approach to data deduplication, leaving customers to determine which factors are most important to their business. Manufacturing companies must consider a number of key factors in order to select a data deduplication solution that actually delivers cost-effective, high-performance and scalable long-term data storage.

There are eight important criteria to consider when evaluating data deduplication solutions:

1. Address The Principal Issue

The first consideration is whether the solution attacks the area where the largest problem exists: backup data in secondary storage. Duplication in backup data can cause its storage requirement to be many times that which would otherwise be required if the duplicate data was eliminated. When considered across multiple servers at multiple sites, the opportunity for storage reduction by implementing a data deduplication solution becomes huge.

2. Assimilation With Current Environment

An effective data deduplication solution should be as non-disruptive as possible. Many manufacturing companies are turning to virtual tape libraries (VTLs) to improve the quality of their backup without disruptive changes to policies, procedures or software. Others are deploying a disk-to-disk backup paradigm, which requires a deduplication solution to present a network interface to the backup application. Introducing deduplication into this process simplifies and enhances disk-to-disk backups, performing deduplication without disruption to ongoing operations.

3. Virtual Tape Library Capability

If data deduplication technology is implemented around a virtual tape library (VTL), the capabilities of the VTL itself must be considered as part of the evaluation process. It is unlikely that the savings from data deduplication will override the difficulties caused by using a sub-standard VTL. Consider the functionality, performance, stability and support of the VTL, as well as its deduplication extension.

4. Impact On Backup Performance

It is important to consider where and when data deduplication takes place in relation to the backup process. Although some solutions attempt deduplication while data is being backed up, this inline method processes the backup stream as it comes into the deduplication appliance. Such an approach can slow down backups, jeopardize backup windows and degrade VTL performance over time.

By comparison, data deduplication solutions that run after backup jobs complete, or concurrently with backup processes, avoid this problem and have no adverse impact on backup performance. This post-processing method processes the backup data by reading it from the backup repository after backups have been cached to disk.

5. Scalability

Because the solution is being chosen for longer-term data storage, scalability, in terms of both capacity and performance, is an important consideration. Consider growth expectations over five years or more. How much data will you want to keep on disk for fast access? How will the data index system scale to your requirements? A deduplication solution should provide an architecture that allows economic “right-sizing” for both the initial implementation and the long-term growth of the system.

6. Distributed Topology Support

Data deduplication is a technology that can deliver benefits throughout a distributed enterprise, not just in a single data center. A solution that includes replication and multiple levels of deduplication can achieve maximum benefits from the technology.

For example, a manufacturing company with multiple sites and a secure disaster recovery (DR) facility should be able to implement deduplication in the regional offices to facilitate efficient local storage and replication to the central site. Only unique data across all sites should be replicated to the central site and subsequently to the DR site, to avoid excessive bandwidth requirements.

7. Availability Of A Deduplication Repository

It is extremely important to create a highly available deduplication repository. Since a very large amount of data has been consolidated in one location, risk tolerance for data loss is very low. Access to the deduplicated data repository is critical and should not be vulnerable to a single point of failure. A robust data deduplication solution will include mirroring to protect against local storage failure, as well as replication to protect against disaster.

8. Efficiency And Effectiveness

File-based deduplication approaches do not reduce storage capacity requirements as much as those that analyze data at a sub-file or block level. For example, changing and saving a single line in a four-megabyte presentation. In a file-based solution, the entire file must be stored, doubling the storage required. If the presentation is sent to multiple people, as presentations often are, the negative effects multiply.

If the solution can segregate the data and look for duplication in chunks within actual data files, the duplication detection will be much higher. Some solutions even adjust chunk size based on information gleaned from the data formats. This technique can lead to a 30 to 40 percent increase in the amount of duplicate data detected and have a major impact on the cost-effectiveness of the solution.

Focus on the total solution

In today’s environment, as stored data volumes continually increase while IT spending at manufacturing companies decreases, data deduplication is fast becoming a vital technology. Data deduplication is the only way to dramatically reduce data volumes, slash storage requirements and minimize data protection costs and risks.

Although the benefits of data deduplication are dramatic, manufacturing organizations should not be seduced by the hype sometimes attributed to the technology. No matter the approach, the amount of data deduplication that can occur is driven by the nature of the data and the policies used to protect it.

In order to achieve the maximum benefit of deduplication, manufacturing organizations should choose data deduplication solutions based on a comprehensive set of quantitative and qualitative factors rather than relying solely on statistics such as theoretical data reduction ratios.

FalconStor Software provides open, centralized, unified data protection and storage infrastructure across multi-vendor and multi-platform environments. For more information, visit https://www.falconstor.com/