The Problem with Data Masking Techniques

It is no longer a secret – many healthcare organizations are sharing health data. More standards are emerging as a result, including new standards by NIST on de-identifying datasets.  Let’s be clear: data masking and de-identification are not the same. Believing data masking techniques are on par with proper de-identification could prove costly. The risks to your organization’s brand, reputation, and overall bottom line are not worth confusing these methods.

Problem 1: Data masking techniques do not use metrics to measure the actual risk of re-identification. It is not always possible to know whether the transformations performed on the data were considered sufficient to de-identify it and can be deemed defensible.

Solution: Combining masking with de-identification techniques provides the risk-measurement-based approach that is needed to safeguard privacy. Using a risk-based approach ensures that the correct techniques are used and provides for the best protection. As stated by the HHS, “Patient demographics could be classified as high-risk features. In contrast, lower risk features are those that do not appear in public records or are less readily available.”

Problem 2: Data masking only deals with direct identifiers. Data masking techniques typically attempt to eliminate direct identifiers. Direct identifiers are data fields that can be used alone to uniquely identify individuals, like name, email address or Social Security Number. Typically, direct identifiers are not used in statistical analyses.

Solution: Distinguish what types of identifiers are in your data. Quasi-identifiers are fields that can identify individuals and are also useful for data analysis. Examples of these include dates, demographic information, such as race and ethnicity, and socioeconomic variables, like occupation and income. This distinction is important because the drawback of dealing with only direct identifiers is that the risk exposure from the indirect identifiers remains.

Problem 3: Masking effectively eliminates the analytic utility. Many masking techniques destroy the data utility of the masked fields. Masking should only be used on fields that will not require any analytics.

Solution: Find better, proven methods of de-identifying that will help keep data quality high. At the end of the day, the data is being masked so it can be used for secondary purposes, like research, post-marketing surveillance, monetization, and analytics. These efforts deserve granular, high-value data. De-identification is a risk-management exercise; by learning the risk and managing them, your organization can reap the rewards.

Learn more about data masking pitfalls in our whitepaper: Avoid the Blur of Data Masking. Download it here.