Planning Your Quality Control Process

Last updated
Save as PDF

Robust quality control measures are critical to the success of any data integration, especially those that are automated. Effective quality control measures will ensure that you've transferred all records and data points, and that your data is appropriately formatted to enable subsequent operations in local systems.

From a quality control perspective, there are three main measures of a successful data integration:

The data you've transferred data enables the completion of your desired task.
You've transferred all records: person and application.
You've transferred all data points.

What to Look For

The most important thing to check is whether staff at your institution can complete the necessary tasks with the data after it has been transferred. If, for example, your goal is to perform admissions review on completed applications transferred from a CAS, your admissions staff should be able to use your local system to review all the data necessary to complete their review. If the admissions staff can't complete their review because data are missing, then, obviously, the data integration has not been successful.

When checking to see that you've successfully transferred all records, consider the following points. Keep in mind that two types of records will be transferred: person records and application records. For any record, you'll want to check that there is no duplication (e.g., Jane Doe exists twice in the system, Jane Doe's MS in mechanical engineering application exists twice, etc.). You'll also want to check that there is no unwanted overwriting of existing records (e.g., Tom Smith's nursing application replaced his physician assistant application, etc.). It's also important to keep an eye out for records that fail to transfer. Failure to transfer can result from a failure to export the appropriate records or a failure to load the exported records.

When reviewing the loaded data for successful transfer, consider the following points. You'll want to make sure that all of the desired data points are being successfully exported and loaded. Look over your exports to make sure all the expected fields are present. Check your imports to make sure all delivered fields are loaded. It's vital to check that the loaded data assumes the proper form: if values need to be mapped to specific values in your local systems, make sure that happened; if values need to appear in a specific format (e.g., dates), make sure that has happened. You'll want to ensure that the data points have been loaded to the correct records (e.g., Harry Jones' contact information shouldn't appear in Alice Callahan's record). Additionally, check that the data points have been loaded to the proper fields (e.g., Shirley Davis' email address shouldn't appear in her last name field). It's also important to check that incoming data hasn't created duplicate entries (e.g., multiple identical mailing address entries) or overwritten sensitive data (e.g., replacing a person's full middle name with just a middle initial).

Important Note: it's easier to find errors than it is to find missing data. Missing data doesn't look wrong because it doesn't look like anything! It's important to ensure that no data is missing in the destination system, in addition to whatever checks you do to ensure that your data has loaded properly.

How to Look

Spot checking individual records is error-prone and can be time-consuming, especially when a lot of data is being transferred. For these reasons, it's preferable to use programmatic quality control checks performed by computers. You'll always need to have someone review the results, but a computer can check a lot of records very quickly and provide a summary. A programmatic quality control review tool can be built with any software on hand, from Excel to database tools to coding languages. The basic framework is to check that the expected data has been loaded as expected. It's important that this check be performed independently of the mechanism used to load the data to the destination system; an independent check will reveal any errors built into the ingestion mechanism.

To perform a quality control check on your data integration:

Gather the raw data delivered from the source system.
Gather the field mapping used to direct source fields to destination fields.
Gather the transformation performed on the raw data.
Gather the relevant data from the destination system.
Compare the source data with the resulting destination data.

Gathering the raw data from the source system can be folded into the data integration process itself. If the data was delivered in a file, store that file in an orderly fashion for future review. You could store the file itself in a directory at your institution, or you could load the raw data to its own table in a local database. If the data was delivered in another format (e.g., JSON response to an API call), your options are similar: create a file for each data set or store it in a table in a local database.

Field mappings are often expressed in crosswalks, which are documents with two columns: one for the name of the source field and one for the name of the associated destination field. Understanding the field mapping will enable you to check data loaded to destination fields against the appropriate source fields.

Often source data must be transformed to meet the requirements of the destination system. These transformations are most frequently value mappings (e.g., "X" in the source system is called "Y" in the destination system) and formatting changes (e.g., "11/13/2023" becomes "2023-11-13"). Understanding how the source data is transformed is necessary for comparing data across systems.

Each system has tools for reviewing data. Experts at your institution will know how to retrieve data from your local systems for review against the raw data. Use the field mapping information gathered earlier to identify the relevant data to retrieve from the destination system. You'll want to make sure you're looking at the right population of records. Often, you can accomplish this population selection using time of load/time of record creation. For example, if you wanted to do a quality control check on records loaded earlier this morning at 7 AM, you could retrieve data for records created between 7 and 7:10 AM (including a buffer for time to load). An additional parameter for your population selection is the source of the record. It's possible that other records are being created in your local system by other processes separate from your data integration. Those records should be excluded from your quality control check.

When comparing the source data with the resulting destination data, you'll want to ensure that all expected records and data points were properly transferred (see What to Look For above). Row counts are an effective tool for checking that the appropriate number of transferred records; this check can be done without transforming the source data. Make sure that the expected number of person and application records were transferred. To compare the source data with the resulting destination data in detail, programmatically transform the source data gathered earlier according to the field mapping and transformation gathered earlier. With this independently mapped and transformed source data, you can double-check that the resulting destination data is correct, in the appropriate record, and is in the right field. Don't forget to check whether data is being duplicated or overwriting sensitive fields.

How to Fix Problems

Ideally, your quality control review will reveal no issues. If that's not the case, however, you'll want to have a plan for dealing with problems. That plan will start with understanding the types of problems you might see. Generally, there are two types of issues: (1) missing data and (2) bad data.

Missing data problems are situations where data fails to transfer to your local system. In general, the solution here is to load the missing data again, which might require adjusting your regular exports to include any lost data. The failure could be at the point of export or at the point of import. A first step in addressing missing data problems is to try manually kicking off the data integration process again; it could be that a small hiccup interrupted an otherwise functional process. If that doesn't work, it will be necessary to diagnose the failure more specifically. Step through the integration process to see where the failure occurs: separately export the data and try loading it to your local system. If the CAS export fails, try it again and if it still doesn't work, reach out to customer service. If the failure is in the import, you'll have to dig even deeper. Check to see that the data is making it to the point of ingestion: a change in data format (a new field or an extra space or an extra-long text string) can cause a failure. If the file is making it into the system but data isn't being written to records, you might have a problem with the formatting or contents of an individual data field. Your local system likely produces logs that can be reviewed to better understand the error.

Bad data problems are situations where data was loaded incorrectly. In general, resolving bad data requires eliminating any incorrect data from your local system and then loading in the correct data. The solution to this problem is effective testing during the build stage; if you test a broad variety of possibilities in your data integration, you're more likely to catch possible bad data problems before they happen in production. Examples of bad data problems are duplicate records, data loaded to the wrong fields, and data loaded with an incorrect value mapping (e.g., "Tennessee" is loaded as "KY"), among others. Refer to the What to Look For section above for an overview of the most common types of bad data. Your local system will have different ways of eliminating data; work with the experts to determine how to best accomplish this operation. Before loading the data that didn't load properly, review and test your import process to determine and correct the cause of the error. Once that's been ironed out, you can load the data that failed to import properly.