Preparing Data for Self-Service Analytics
Self-service analytics promises an agile and intelligent environment for business decision making. Conceptually, every level of the organization is empowered with data and can choose the path that is best for the greater good. While the concept is sound, effectiveness depends on the approach. Enabling access to clean data that is faithful to the business it represents is more involved than simply providing access to data. This is where reporting tables come in.
Reporting Tables Prepared for the Data User
Good reporting tables expose a view of the data with the highest-possible level of detail while transforming or omitting dirty data. They allow business users access to data in a way that maximizes the likelihood of insight and minimizes the risk of false or inaccurate conclusions. This element of data modeling drives self-service analytics by helping users understand, trust, and analyze organizational data.
Reporting tables have typically been designed around a star or snowflake schema, but there may be cases where these schemas are neither necessary nor beneficial.
For some businesses, a few large, multi-source reporting table (or database view) could encompass all the relevant data needed for reporting. Take a young e-commerce business for example. With items sold in the thousands, an all-encompassing table could list every item of every transaction that the business has ever made, including details about the item’s promotion, the transaction, the marketing attribution of the transaction, and the customer’s details. Given the processing power and columnar architecture of modern data warehouses, it is now possible to analyze large tables like this rather than joining several dimensional tables onto a facts table.
Using Reporting Tables: An Analogy for Humans
In recent years, the tech behind taking a great photo has increased dramatically, though that alone doesn’t guarantee truly capturing the essence of a moment. As photography is literally snapshots of data, a photograph provides an apt metaphor for good reporting table design.
It takes a trained eye to understand what a proper canvas for a photo looks like, just as it requires a knowledgeable, comprehensive approach to model a table that exposes clean, useful data. Both seek to capture the desired object(s) within the given landscape, while omitting that which isn’t necessary — think of dirty data as the undesired photobombing of reporting tables, corruptive and filtered out ahead of time whenever possible.
We can take this real world analogy a step further. If our intention is to photograph a beautiful rose garden, then black and white or sepia filters may not serve our needs, and the representative data will be limited (what color are those roses, anyway?). Conversely, deploying a wide-angle lens that captures the roses along with any other flowers, a quaint cottage, the surrounding countryside and so on, provide far more info than is desired. It’s too much to summarize relative to the stated intent.
Likewise, if your business is selling local roses, you need data faithful to all things qualitative and quantitative regarding optimal returns for your operation: overhead, crop health, etc. And specific attention to certain fields — product demand projections for Valentine’s Day, for example — must be emphasized not overlooked, while simultaneously bypassing available but unnecessary metrics, such as the going price of a dozen roses in other states, or from a large, commercial nursery.
In both examples, allowing for too little, undesirable, or too much data is to invite unwanted noise. Our goal of delivering data containing the highest possible value is best served when adhering to these principles of detail.
Reporting Table Design and How to Avoiding Errors in Data Preparation
As the example above shows, there are two pitfalls to avoid in modeling reporting tables. You must balance the tradeoffs between enabling deeper analysis, and the risks associated with dirty data.
Losing too much detail
- Pre-aggregated data creates a false notion that the data cannot be further interrogated, that there are no more details to dig into
- Pre-processing can introduce errors in the data that are not easily discovered
- Important nuances in data collection and business history are lost
On the far end of this type of error is the Summary Table. Summary tables are provided to business users to provide a high-level understanding of the state of the business. But depending on the aggregation of the data, these summaries can make it impossible to understand the true nature of the data.
A sales manager might want to see how sales are trending over time by each sales region. By summarizing the data into regions, you risk the possibility of a few outlier cities in the data influencing the entire data set. For example, the Northeast region may appear to be trending up but what is actually happening is a growing New York market is trending up while all other cities are flat or trending down. This false conclusion could lead to missed business opportunities in smaller Northeastern cities.
By providing the high-fidelity data, perhaps at the transaction level, the sales manager can group and summarize the data by city, state, or other meaningful partitions based on the business case.
If the data is not curated
- Unnecessary fields create distractions that lead to meaningless conclusions
- Dirty data, collected in error, can lead to inaccurate summaries
- Near-duplicate fields can lead to different conclusions of the same data
- Business users are forced to learn the nuanced relationships between tables, and how to join them.
Data warehouses will often house data from multiple sources that have been replicated from one source to the next. In this case, it is not uncommon for there to be duplicate fields with similar names. Issues can arise when one analyst uses one of these two fields in a report and another analyst uses the other. Often the data that comes in from the system where the data was replicated to will have further modifications and updates than the system where the data was replicated from. In this case, the same report using the “same” field from two different sources can lead to discontinuity, not to mention disputes between analysts or business units.
Creating, socializing, and agreeing upon reporting tables that represent “a single source of truth” within an organization’s data is invaluable for scaling up a self-service analytics practice.
Reporting Table Design Best Practices
Beyond finding the right balance between data detail and data quality there are several other considerations in modeling usable reporting tables.
Schema Design and the Data Consumer
Many data management professionals regard the star schema as the most effective way to model a reporting data warehouse. There is, however, one additional consideration when modeling your reporting data warehouse for self-service analytics: the business people who will be using the data for analysis, the actual data consumers.
If self-service is truly the goal, then consider what tools the data users will be using. Not all BI tools support joining and merging data easily—and even if they do, many users will not have the skill to do more than group and pivot a single table. For this reason, a star or snowflake schema may make some analyses impossible for some users. Rather than requiring business users to join dimensional tables around a facts table, you may consider providing pre-joined tables to your end users. This way, less capable SQL users can use their favorite BI tool to manipulate the data and still understand the wider relationships in the data.
Performance and Agility
A dataset’s value and utility is multiplied when combined with other related but disparate datasets. Unfortunately, this is when schema design best practice loses its practicality, and performance can become an issue. Creating reporting tables requires not only balance between detail and curation, but also speed and comprehensiveness.
The more datasets are joined together, the more time it will take to create a reporting table view. As a best practice, merge datasets at runtime. But when the processing time becomes a blocker to the agile analyst, it may be time to separate a single reporting table into smaller data domains that are frequently analyzed together but can be easily joined together. Read more in the discussion of Runtime vs Preprocessing.
Statistics: To Aggregate or Not to Aggregate
For every collection of transactions, events, or objects that a table represents, there are an infinite number of statistics that could be paired with it. In the example of the e-commerce reporting table, a row that represents a product within a transaction could be accompanied by the MSRP and the quantity of items sold in the transaction.
Use caution when including pre-aggregated statistics in reporting tables. In the example above, it would make sense to report an average price for that item in a given month, but an average of the quantity of items sold in the corresponding transaction would be inaccurate for transactions that contained two of the same item. When creating fine-grained reporting tables, include metrics that are easily and sensibly aggregated by the end user, and note when they are not in a data dictionary.
Teaching Data Users to Fish
In fostering self-service analytics, the demands of the data manager and data analyst go beyond data modeling, and into the realm of education and evangelism. Reporting tables are only as valuable as the reports that users create from them. You must secure agreement from all stakeholders, from data collectors and creators to data consumers, that the reporting table accurately represents the data as a single source of truth. From there, you must properly define the incoming data by employing a data dictionary, and train users to apply it correctly, mitigating the risk of drawing false conclusions.