Static Data Masking

Data masking as we all know is a data protection layer which replaces/scrambles/masks sensitive data being disclosed to unwanted/unauthorized users. “Static Data Masking” also known as Persistent data masking is a method to protect data at rest. It is a new security feature released in SQL 2019 (available for public preview at the time this blog was written) that helps users create a copy of masked [sensitive] data from production environment. Using this feature, copy of the live data is crated with appropriate masking functions; and the masked copy can be shared with users who intend to work on non-live data. This feature also helps keep organizations compliant which are subject to data protection/privacy regulations such as GDPR.

Data masking process starts with users configuring masking operations for the columns in database which contain sensitive information. Data is copied to a new database during the data copy process from the live system and then masking functions (according to the masking configuration) are applied to mask the data at column level. Unlike dynamic data masking, static data masking is persisted and irreversible (one-way process), original data cannot be retrieved.

Static data masking can be used for development, testing, analytics and business reporting, compliance, troubleshooting, and any other scenario where specific data cannot be copied to different environments.

 

References:

SQL Server 2019 CTP 2.1

Advertisements

Replacing a Legacy System: Part-2

In my last post we focused on “WHY” we should replace a Legacy system. In this post we shall try to understand a few important challenges that should be addressed.

  • Data Loss

One of the biggest challenge for any data/application migration project is that target system (new technology system) should readily consume source data (legacy system). However, there are plethora of tools, especially created to support migration, but it takes thorough analysis to map, transform and migrate legacy data. Often, Data Architects (DA), are accustomed or well versed with either source or target system (and rarely both) – the successful migration depends on their dexterity to mitigate gaps and build bridge between both source and target systems, making data movement very soothing.

  • Data Transformation/Cleansing

Data transformation is the heart of any migration project. This involves complex analysis of existing data, meta-data, business constraints, data integrity constraints, mandatory attribute, defaults attributes, derived attributes, etc. DA’s have to wear multiple hats as a business analyst, system analyst and developer, often taking to business stakeholders, infrastructure team, and project team to design migration specifications and transformation rules. Hence making data ingestion an error-free process.

  • Data Quality

As it happens, some legacy systems are poorly designed, which leads to lot of data-quality issues. These issues should either be fixed, by recommendations from business or they should be filtered before migrating the data. Apparently, filtering one such erroneous record means leaving behind other good quality records (linked together) behind in the legacy system due to integrity constraints. That is where an intelligent DA will extrapolate and try to fix such erroneous records using pseudo values. Again, in the end, it is a business decision and by all means [any modifications] should be agreed upon by the stakeholders.

  • Reconciliation

Error-free migration does not guarantees that migration is successful – Validation is the key! Best way to ascertain, is to reconcile a use-case between legacy system and the new system. There are many tools that can reconcile data, but manual validation is required by executing same use-case on both systems and recording the out-come. Reconciliation packages should also determine the over-all quality of migrated data and special attention should be given to validate – data loss and data quality.

Replacing a Legacy System: Part-1

Legacy applications are like “First Love”, really hard to let-go! Whilst, it’s an up-hill task to replace them, there are multifarious reasons why replacing a legacy application is beneficial. Here are few reasons that are critical and deciding factors for replacing legacy applications.

  • Cost or Cost-Effectiveness

This is a prime factor for replacing legacy systems. Major chunk of money goes into maintaining the applications and without debate, any organization would have to do so; but is it worth maintaining an app with almost obsolete technology. If I were a CFO, I shall always ask “What ROI does supporting my current app bring vs replacing it with new one?”. Whilst, replacement cost of the legacy application is lower than maintaining it, hence it’s a desirable action in this direction.

  • Integration Challenges

Legacy systems do not always integrate well with the latest technology systems and integration-pain via custom-written apps is more than replacing the entire system. In the era of IOT (and ubiquitous computing) where all devices communicate with each-other and integration is seamless, it makes more sense of adapting the new technology and replacing the old ones.

  • Productivity

Productivity is another enabler in replacement of legacy systems. As basic definition of business changes from product-centric to customer centric, use-case have become more complex. Legacy systems have been doing the heavy lifting (amending and sustaining) at the price of productivity (cost and time). New technology and apps have configurable business-rule-engine, that enhances productivity and makes the process of future adaption more sublime.

  • Laid-back Decision Making

Decision Making has been a challenge while using legacy application because, accessing data and churning out meaningful insights (data analytics) require a separate decision making system. Data from the legacy has to be transformed and refreshed into the decision making system coupled with reporting solution for presentation. Business stake-holders always took “reactive” decisions based on the historic trends (or incidents in terms of security). With the advent of bleeding-edge technology and in-memory data processing, stake-holders are in a position to take pro-active decisions (also mitigate timely risks in terms of security)

Now that you understand WHY legacy applications should be replaced; In my next blog, I shall try to list down important technical challenges that we overcome for smooth replacement.

A go-live Saved!!!

An unfixed issue will always haunt you on the day of go-live. Today was one of that day when customer started complaining about serious performance issues post go-live. Eventually, a small configuration change saved the day and in the end customer was happy!!!!

Here is what customer complained:

  • Application is very slow and there is lot of clocking.
  • Application user has to wait for as long as a minute after pressing a button to fetch data.
  • CPU utilization was constantly high, never came down below 90% post deployment

Our observations:

  • Found more than 1000 queries taking consuming high CPU time (average worker time)
  • There were more than 500 queries with average execution time of more than 10 seconds
  • Maximum Degree of parallelism was “0”
  • Cost Threshold of Parallelism was “5”
  • SQL Server Box has single NUMA node with 8 logical processors

Analysis and Correction Steps

  • Max Degree of Parallelism: As per guidelines for max degree of parallelism, SQL Server with single NUMA node with 8 logical processors should have MAX DOP setting of 8 or less. Default setting of zero should always be changed based on the CPU cores available to the SQL Server.
  • Cost Threshold of Parallelism: This setting was also set to default, as per recommendations by experts this setting should be set between 25-50. But one has to always test and find the a number that is not too high or not too low. A low value say 5 (default), means lot of queries whose cost is greater than 5 will be chosen to execute in parallel. There are queries that does not require parallel execution, but they are forced to go parallel and hence execution time shoots up. The opposite is true when the value is high. A candidate query that should execute in parallel will execute as a single thread with high execution time. We recommended the pre-tested value of 30 measured for an equivalent workload.
  • High CPU: CPU utilization was high as almost all the queries went for an implicit conversion from NVARCHAR to VARCHAR. Then happens when a parameterized query declares default NVARCHAR(4000) for a string parameter from application. The underlying column in the database is of type VARCHAR, hence query goes thru implicit conversion. This behaviour is called parameter sniffing, where SQL Server complies the execution plan sniffing the parameters from the input query and use that plan whenever the query is executed. Apparently, implicit conversion makes the existing indexes unusable and query goes for a Clustered Index Scan or a full table scan (for Heap). This also shoots up the execution time of the query. This fix to this problem is to let SQL Server know the datatype of parameters. In our case, we know that there is no NVARCHAR column in the database so a small change in jdbc url solved the problem (sendStringParametersAsUnicode=false).In the end, it was a day accomplished (with happy customer) and go-live saved 🙂

Application Tuning vs Database Tuning

Recently, there was a debate when application developers and database developers were in tussle and pointing fingers on each other for a pity performance issue 🙂

Here are few pointers over the debate and little initial development:

  • Application Developer: There is poorly performing query when being executed from the application. Database developer’s needs to analyse and fix it.
  • Database Developer: The query when executed from database works fine and provides relevant results quickly.
  • Database Administrator: Query pulled from the plan cache – goes for a full table scan and parameterized. There are indexes defined on the predicates, but they are not getting picked-up!

This is what developers saw when executing the query from management studio.

DeveloperPlan

This is the execution plan generated when query is executed from application.

ApplicationPlan

Clearly, looking at highlighted operators in the plan, we can figure out that query executed by application goes for a Clustered Index Scan as opposed to Clustered Index Seek.

Why this difference?

It boils down to the parameterization datatypes!!!!

Application developer passed one parameter of “String” type in the filter predicate, which had a column of type VARCHAR. And SQL Server created a parameter of type NVARCHAR by default for any “String” literal. Hence, to match data type on both side of an equivalence operator, the column with VARCHAR datatype is implicitly converted to NVARCHAR. If we hover over the clustered index scan operator, we will see CONVERT_IMPLICIT being used to perform the conversion task (see highlighted below).

ApplicationScan

On the other hand, when the same query is executed from management studio, query uses Index Seek as no implicit conversion is required.

DeveloperSeek

Possible application fix: Application developers should describe parameter types which gets passed in the query. In NET APIs parameters can be described by using SqlParameter class, that contains not only for parameter name and value, but also for parameter data type, length, precision, and scale.

 

References:

How data access code impacts database performance
SqlParameter Class