Table of Contents
Big Data is not a single technology, but a combination of old and new technologies that help companies gain actionable insight. Big Data is therefore the ability to manage a large volume of disparate data, at the right velocity, and within the right time frame to enable real-time analysis and reaction.
1 Introduction to Big Data
Data management and analytics have always offered the greatest benefits and the greatest challenges for organizations of all sizes and across all industries. Companies have long struggled in the search for a pragmatic approach to capturing information about their customers, products, and services.
Big Data: Introduction
Businesses have long sought to optimize the management of their data, both for operational tasks and to monetize the information they could extract from it. Nowadays, the amount of data available is immense, of very diverse origins and types, and generated at great speed. This is what has led to the coining of the term Big Data, not because it is a new technology, but because of the volume, speed and variety of data to be managed.
This concept has very specific requirements in terms of performance and architecture. These requirements have been met by taking advantage of the evolution of technology, which has caused the cost of hardware and software to decrease at the same time as their capacity has increased. Among other technologies that have contributed to making Big Data a reality are virtualization, cloud computing and, above all, distributed computing, which is the basis for all the others.
2 Technological Foundations of Big Data
As we have seen in the previous chapter, Big Data deals with streams of large volumes of data, often at high velocity and with very different formats. Many experienced software architects and developers know how to deal with one or even two of these situations quite easily.
Big Data deals with streams of large volumes of data, often at high velocity and with very diverse formats. Many experienced software architects and developers know how to deal with one or even two of these situations quite easily. For example, if faced with large volumes of transactional data and fault tolerance requirements, it is possible to choose to deploy redundant relational database clusters in a data center with a very fast network infrastructure. Similarly, if the requirements are to integrate different types of data from many known and anonymous sources, the choice might be to build an extensible meta-model in a custom data warehouse.
However, we may be able to afford to create specific implementations in a much more dynamic Big Data world. When we move outside of the world in which we own and fully control our data, it is necessary to create an architectural model to address this type of hybrid environment. This new environment requires an architecture that understands both the dynamic nature of Big Data and the need to apply the insights to a business solution. In this unit, we examine the architectural considerations associated with big data.
Big Data: Technology Fundamentals
The infrastructure to implement a Big Data solution is very complex, so to simplify the concepts we resort to a model divided into layers where each layer performs specific functions. These layers are:
- Physical
- Security Infrastructure
- Interfaces
- Operational Databases
- Data Services and Tools
- Analytical Data Warehouses
- Big Data Applications.
This structure has very demanding requirements in terms of scalability, elasticity, availability and performance. These requirements can be met with technologies such as Cloud Computing. In turn, Cloud Computing is based on virtualization technologies that provide it with its most important characteristics.
3 Big Data Management with Apache Hadoop
Big Data is becoming an important element in the way organizations leverage large volumes of data at the right speed to solve specific problems. However, Big Data systems do not live in isolation. To be effective, companies often need to be able to combine the results of Big Data analytics with data that exists within the enterprise.
We will now take a closer look at today’s most widely used Big Data environment, Apache Hadoop.
Big Data & Apache Hadoop
Big Data is getting an important element in the way associations work large volumes of data at the right speed to break specific problems. Still, Big Data systems don’t live in isolation. To be effective, companies fr equently need to be suitable to combine the results of Big Data analytics with data that exists within the enterprise. In other words, you cannot suppose of big data volumes in insulation from functional data sources. There are a variety of important functional data services. In this unit, we give an explanation of what these sources are and how we can use them in confluence with Big Data results.
A fundamental component of any data processing and management solution is databases. The same is true when the solution we are going to use has to process huge amounts of data, databases will also be necessary. However, due to the specific characteristics of this type of solution, it is very likely that the most suitable database engine will not be the same as in traditional solutions.
The database engines used in Big Data environments must provide the necessary capabilities for the requirements of these solutions: speed, flexibility, fault tolerance… The emergence of Big Data has led to the development of different engines to the traditional ones, which were based on the relational model. Among these engines we have: KVP, MongoDB, CouchDB, HBase, graph databases…
In addition to the data storage system, the other important component in a Big Data solution is the processing engine. In this unit we have studied in more detail the processing engine of the most widespread Big Data solution today: Hadoop MapReduce. The characteristics of the “map” and “reduce” functions combined are what determine the power and flexibility offered by MapReduce. In a very simple way, it works by dividing large amounts of data into smaller portions and distributing them to be processed in parallel on multiple nodes of a cluster.