Overview:
Around 3 terabytes of data is getting added every month and it has an upward trend. It was a challenge to manage the storage and processing to fulfill reporting needs.Earlier setup used Teradata as the database. Though Teradata per se can manage the ever-increasing data volume but in order to scale it up to a matching level of future data volume, the cost is highly prohibitive combined with possible performance issues in massive data processing.
As such, client opted to go for a cost effective system that can:
In addition, it was imperative to re-engineer the existing data model so as to optimize the overall processing steps for the reporting needs.
In order to circumvent the said bottlenecks, it was suggestive to replace Teradata with Hadoop that can be leveraged around below aspects
- Build to hold massive data in a multi-node cluster environment
- Reduced cost. It can be set up on commodity hardware
- Distributed parallel processing using Hive and MapReduce
- Highly scalable and fault tolerant
Along with this, it was also analyzed that introduction of a common reporting platform in terms of a new data layer for multiple applications would help optimize the overall processing steps as well as storage needs.
Solution:
Ana-Data was engaged all through the SDLC (requirements, design, develop, test, deploy and support) of this system.
The key components are:
1. Source Databases: Contains transactional data that are distributed across multiple databases called Nodes (Oracle databases)
2. Stage Layer: All the data in the Nodes are consolidated in the Stage Layer. No aggregation is done in this layer.
3. Hadoop Base Layer: Data moved as-is from the Stage Layer to the Hadoop Base Layer using a customized replication framework
4. Consolidated Layer: Newly introduced layer wherein aggregations are done and a common platform is built that can serve as the source of cooked data for multiple applications for their reporting.
5. Downstream Applications: These are the various applications that cater to reporting needs with respect to theircorresponding business functions. These applications pull data from the “Consolidated Layer” and forward the same to report consumers in form of delimited files using extraction framework. The report consumers, in turn may use their own front-end tool for rendering their reports as per their required format.
Benefits:
The key business benefits can be summarized as -
Reduced IT costs
Hadoop being an open source system, there is no license cost. As it can be set up mostly by using commodity hardware, the initial investment as well as cost involved in scaling up the system to handle future high volume of data is cost effective.
Highly Scalable and Fault Tolerant
The system is built to handle massive data and scalable and fault tolerant aspects are inherent. It is set up in a clustered framework and new nodes can be added cluster to handle the ever growing volume of Data.
Common Reporting Platform
The newly introduced data layer named “Consolidated” provides a common platform for all the applications to access data from for the reporting needs. Prior to its implementation, each application had their own set of data layer to pull data from. This has resulted in less maintenance and less complexity in processing the data for the downstream applications.