Tokoin is pleased to announce that we have officially released our updated Whitepaper, the most comprehensive look into the Tokoin project and ecosystem to date. Like the recent listing of $TOKO and…
As the use of data warehouses and business intelligence tools proliferate, managers are discovering that they cannot ignore the quality of inbound data to these systems. The 80/20 rule for data quality that was justified during the deployment phase becomes a significant roadblock as the system matures. Unfortunately, analysts have become accustomed to “fixing” the data in spreadsheets before making it ready for consumption. Returning to the point-of-entry for the data is typically beyond the scope of those most impacted. The ‘fix’ usually resides in another organizational silo or even from external sources. And, the larger the company, the further apart are the siloes that need to cooperate (time and budget) to fix the issues. Because Excel and a little elbow grease can often fix the problems, the messy data condition gets kicked down the road.
But who likes to clean things up? It’s more satisfying to make progress towards your business objectives than to get your house in order. That position only holds true until the issue gets in the way of that progress. This is especially true when management makes a bad/costly decision based on inaccurate metrics.
Yet, an even more consequential — and negative — impact to the data-chaos is the loss of productivity in your workforce. Analysts that spend cycles fixing issues are not spending cycles driving the business. In this age of automation, a manager cannot simply state “that’s just the way we do things here” anymore. There are solutions, if only management could determine how best to prioritize them.
Fixing broken data has a second-order impact: employee turnover. Those analysts who are capable of efficiently fixing the broken data did not originally pursue their career path to do these data-hygiene tasks. It is not particularly interesting nor satisfying. If an analysts is expected to sustain regular hours, days or even weeks of data-fixing tasks, their job satisfaction is going to plummet. These tasks can be mind-numbing, frustrating work. Management does need to consider the cost of losing and replacing analysts when prioritizing data-integrity projects. The lost institutional knowledge and the impact to the business objectives while a new-hire is trained can outweigh a modest investment in improving data integrity.
Finally, not improving the quality of inbound data also disables using that data for more advanced applications such as machine learning and predictive analytics. What could have been a tool to differentiate and accelerate a business is relegated to a reporting tool that can only look into a murky rear-view mirror. Meanwhile, the competition — especially those with less data-legacy — are able to leverage these advanced tools and establish a sustainable advantage.
So, “garbage-in-garbage-out” can be mitigated by some heavy-lifting analysis on the output if you are willing to sacrifice team resources. But, still, that strategy cannot scale nor can it leverage the more advanced tools that are available to accelerate the business.
Upstream quality control is a precursor to downstream data automation projects. Within the QC program are at least three categories that need to be considered:
Inbound data often originates from external entities. There is little or no authority over those entities and thus no means to enforce accountability upstream. What you get is what you need to use. So, assume it will be inconsistent and volatile in its quality. This is especially true if the external source is executing any manual manipulation of the data before passing it onwards. A process that allows for manual adjustments cannot be assumed to be 100% stable. You must assume volatility in the data quality and design your workflow accordingly.
This process begins with the “Automatic Issue Identification and Routing”. In the case of an inbound file, there are standard assessments that can be executed utilizing commercially available workflow tools. These assessments start at the top levels and work their way down in granularity.
First, the integrity of the data file should be assessed before the data within the file. These tests include:
The comparative assessment for these categories could change based on the originating source of the data. If the originating source is a manual upload, the assessing system should provide immediate feedback to the originator so that they can take immediate action to rectify the issue. Typically, the originator will recognize their fault and take action. If the file is pulled and fails these tests, an alert notifies an analysts to take action on the file. Ideally, every alert scenario is an opportunity to deconstruct the issue and design automated repairs when possible. If repair is not possible, then tracking the frequency of the issue and notifying the originator of the trend provides evidence that may influence a fix on their side. Evidence trumps opinion in this case.
Next, the integrity of the internal format of the file is assessed. This typically involves a workflow to determine which tests to execute based on the originator and filetype. These tests typically include:
Other file formats have their own unique tests at this level. The data team should brainstorm scenarios where issues could arise so the proper tests can be designed. This can be a very satisfying meeting as team members share their war stories.
Next, the data format should be assessed. This can be a complex test sequence and have many branches based on the content and the context. A file on Monday may have different standards than a Tuesday file. A variety of tests can be executed:
The next level of assessment is highly contextual. This is driven by the originator, the filetype, and prior data. A file and data format may check out, but following this deeper assessment it may be found that part of the file includes records that are duplicates from prior inputs. Or, a record uses a primary key value already used by a different record. Maybe a date is outside of an acceptable range based on input from a prior load. Or a zipcode is inconsistent with a state code. The scenarios are endless. Nonetheless, if the data-cleansing team collaborates and documents the analyses they have had to execute to fix past data, a decent set of tests can be quickly designed.
It could be assumed that portions of the records within the file are acceptable and could be imported. But, unless the rejected records can be successfully repaired via internal processes, it is highly improbable that a workflow will be able to successfully monitor and manage the originator as they repair the issues. The next inbound file will have to be reconciled with the first — and that could get messy. This scenario could result in a complex and chaotic reconciliation process. It may be best to establish a threshold for issue/type-count and reject the file if it exceeds that value. Then import the fixed file as a complete, first-entry task. Only in cases where timeliness of the ultimate output is the most critical driver (e.g. quarterly reporting, client deadlines) should a partial-import and future reconciliation process be utilized.
Nonetheless, contextual data-level issues are also opportunities to improve the workflow and design automated fixes. If the fix is simple (e.g. Change all “Y” to “YES”) then there should be an automated workflow to manage it. This may also be part of the downstream ETL process and be a trivial fix. Assume continual development and use every manual fix exercise as a trigger to build automated solutions.
There are a variety of approaches to manage the assessment and fix workflow, but simplicity is typically the most judicious approach. Given the present low cost of CPU and data capacity, parsing each data object into individual records is recommended. This is a relatively simple process in an ETL application. This allows for the test to be managed using the metadata associated with each of the data. Scoring can also be entered into the metadata without requiring complex linking and tracking. Following this testing phase the original record could be reconstituted using that metadata. If an automated repair has been developed for the issue identified, it will execute only against the individual record and will not expose any other records to similar and potentially inappropriate edits. Additionally, the design of the workflow can be more simple as the object being managed is an individual record.
Another benefit of parsing the data into individual records is the simplicity of building reporting. When aggregated, the issues can be collected and grouped by data type, field, and originator. With this in hand, action can be taken to discuss solutions with the originator or to invest in more automated fixing processes.
Finally, parsing the data at this stage of the assessment does not impact the original record. Therefor, that original record could be re-parsed if tests reveal the parsing was unsuccessful. And, if the file passes all tests, the workflow could simply act on the original file if no fixes had to be executed.
Mistakes happen. Issues arise. Systems fail. A data warehouse should not assume perfection at all levels. Therefore, an exception entry procedure should be implemented where most appropriate. This involves inserting the adjusted value or the “ignore” flag along with the timestamp, entity that made the entry and a reason code or description (it could have been a programmatical fix and not a person), and an active flag (so future edits can take precedent). Then, instead of querying from the original table, build a view using queries with the appropriate join-logic with this fix-table. That view becomes the new source of truth for all downstream procedures.
Formalizing the adjustments into this format allows for historically-bad data to be entered wholesale and managed. Additionally, any adjustments are tracked for audit purposes (formal or internal). And, at the output UI these adjustments could be queried and highlighted if relevant to the certification processes in place for that report. If the organization is exposed to an audit, this trail of adjustments provides immediate transparency and can mitigate further investigation.
Upstream QC is the first step in enabling automation of downstream reporting and other data-applications. It is an elephant that can be tackled if done systematically. Once in place, the Automation process for data assessment can be considered — but only after this QC is in place. Topics for automation include Designing for End-State Metrics, Contextual Management, and Standardized Algorithms and Models. Concurrently, a program for Data Integrity should be engaged. High-level topics to ensure data integrity include Procedural Integrity, Organizational Controls, and Auditability (as a touchstone to guide/prioritize if not formalized). All of these need to be considered in varying degrees in order to stand up an efficient and productive data operations. Once in place, the organization will begin to recognize a level of agility towards exploring extraordinary capabilities that eluded them in their friction-filled past.
Trello is an incredible tool to organize oneself and the company. This wonderful tool have helped me to discipline the product building and scaling process. Here are some templates i used and some…
The Center for Internet Security (CIS) recently released an updated set of cybersecurity guidance for smartphones and tablets and how they should be used in the enterprise. This isn’t the first time…
Building your first iOS app can be a daunting task, but with the right tools and a little bit of guidance, you can have your first app up and running in no time. The first thing you’ll need to do is…