Dimensions for Characterizing Analytics Data-Processing Solutions
Essay by Nguyen Hieu • August 27, 2018 • Research Paper • 2,383 Words (10 Pages) • 889 Views
Essay Preview: Dimensions for Characterizing Analytics Data-Processing Solutions
Dimensions for Characterizing Analytics Data-Processing Solutions
Jari Koister
Version 1, August 2015
Also available here as pdf
Introduction
Data analytics and business intelligence have been of interest for a long time, and their needs have been addressed by various technologies based on relational technologies and cubes. Columnar databases and in-memory computing were developed to deal with increasing demand for quick response and interactive analytics. In recent years Hadoop was introduced to deal with very large datasets, which were previously difficult to handle with scale-up solutions. Even more recently, streaming and in-memory processing solutions have been introduced on inherently scale-out architectures. These solutions have been developed to deal with both speed and scale.
As a software architect and data analyst, it is notoriously difficult to determine what solution makes the most sense for a particular data and computation problem. It is made even more difficult because problems often evolve as we start working on them.
As an example, we often initially want some ad hoc analytics to explore and understand the data. Once we determine some interesting models, we often want these to be executable on a regular basis on some data arriving in the system. These two use cases are quite different. Another example is that we sometimes have a need for slow-moving data, while in other instances, new data arrive at high velocity.
There is clearly no one technology that provides an optimal solution for all problems. Traditionally organizations may have multiple technologies and systems dealing with different business problems. They may have a cube solution for business analytics while having a columnar-store solution for ad hoc queries on a subset of data. I call this the parallel stack strategy. It allows organization to meet different needs by providing different solutions, each with its own special advantages and disadvantages. The downside is that each has its own system of records; hence, there is no organization-wide record system. This usually also means that no solution can host all or even a large part of all the data used within the organization.
There are multiple emerging architectures for big data analytics. A unifying pattern is that that they have a data store for system of record stores and multiple stores that provide indexing and serving layers for different use cases. The idea is that the system of record is based on one of the modern scale-out architectures (typically HDFS) that allows an organization to host all data in one place in an economical way. Each serving component hosts a subset of the data or processed results that are solving a specific use case, again with its specific advantages and disadvantages. I call this strategy the lambda strategy after the reasonably well-known lambda architecture. I depict the principles below.
[pic 1]
In the parallel stack strategy, as well as the lambda strategy, one needs to understand and pick the correct storage and serving technologies. In this paper we will discuss a framework for understanding the data at hand and the different technologies available, as well as the trade-offs involved in selecting one of them. Our goal is to characterize the problem’s sweet spot for different solutions and to identify problems that specific solutions do not deal with particularly well. We do not expect the framework to be 100% complete, but we believe it will help in understanding what aspects are important in comparing and selecting technologies for analytics processing.
Data
It all starts with the data. A lot has been said about various ways of characterizing data, so we are not necessarily breaking new ground here.
The first basic characterization is the structure of the data. It can be derived from a relational model with a well-defined number of attributes for each record. Each attribute has a carefully selected type such as DATE, NUMBER, or URL. This is called highly structured data.
Unstructured data is commonly text for which we may need advanced techniques such as NLP to detect structure and meaning. Sometimes these texts are not even a proper language but rather slang expressions, such as tweets.
It is, of course, a continuum. I like to use the term semi-structured for data that has somewhat of a structure, such as using JSON definitions. In this case there is an expected ordering or nesting of data, but many elements may be optional, making it tricky to parse. And the data in each attribute may not be explicitly type-checked when inserted, making it impossible to rely on properly formed data during the processing.
Another type of mixed model is when you have structured data but one of the attributes contains unstructured data such as text.
Size is a central property. Not all solutions can deal with very large datasets. But not every problem involves very large sets of data. So understanding the size of the data that you are dealing with is important. There is no reason to limit yourself to certain solutions if you do not need to.
Another important aspect of data concerns its velocity. We consider velocity the rate at which data are added to a dataset or data are changed. An example of adding may be making new purchases online. An example of changing data may be second granularity updates to a sensor in an airplane.
Some data are updated on a weekly or monthly basis. Other data are changing on an hourly or second-by-second basis. In the most extreme cases data are updated multiple times a second.
We will distinguish between data sink latency and data source latency. Data sink latency concerns the velocity of data as they come into your analytics systems. Data source latency concerns the freshness of the data generated from any analytics processes you have in place. Note that this is different from query latency, which concerns the time to answer a specific question. As the data used to answer queries cannot be more fresh than the data they are based on, the data source latency cannot be more recent than data sink latency.
...
...