Asynch AOG + Cluster node network issues= DOWNTIME!!!

Hi Gonzalo! We are in our critical month day and the SQL Server is not responding! We are seeing this issue “Unable to access availability database ‘xxxx’ because the database replica is not in the PRIMARY or SECONDARY role”. Please help us!

Hi SQL Server Lovers!

Today I’ll talk about a critical situation that happened some weeks ago.. While I was drinking my favorite Argentinian drink (Mate), one of the app team members pinged me saying… Gonzalo! We need you immediatelly (Yes.. I know that you hear this many times….) We are having a P1 in production and it is our critical day! So, I connected to the instance… And this is the history:

  • Topology:

Windows Server 2012 R2 Cluster with 2 nodes + Fileshare witness and SQL Server 2014 AOG Role in Asynch commit located in AWS Data Center.

  • Issue:

From SQL Server side I saw this issue -> “Unable to access availability database ‘xxxx’ because the database replica is not in the PRIMARY or SECONDARY role” Which means that the AOG role was not online.

From cluster side I saw this critical event -> Cluster node ‘xxxx’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. Which means that the primary node lost connection with the other cluster members.

  • Primary assumptions:

Before this issue started to happen… SQL Server 2014 SP2 CU10 was applied.. So the easy way was to uninstall the patch…. After 10 minutes the issue started to happen again.. Next step? Windows patches.. The same day some patches were applied.. Easy way 2.. Uninstall windows patches.. Guess what? After 10 minutes Again…. The main problem was that the primary SQL Server node was lossing connections with the other cluster members and since it was the primary.. The aog went down which means… You know what it means….

  • Seeing the light at the end of the tunnel

After checking with network, cloud and infrastructure teams and following this post also -> https://blogs.technet.microsoft.com/askcore/2012/07/09/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership/ Next step was… “Hey why don’t we check the drivers?”… In AWS the drivers that manage the relationship between the hypervisor and the physical components are PV Drivers -> https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/xen-drivers-overview.html After we checked this, we saw that the version on those boxes were outdated…So, the infrastructure team updated those drivers on all of the servers and…. Voilá! The issue dissapeared.

  • RCA

After having some calls with infrastructure team, they realized that the Nics on the server were having a huge load because of the pick day.. Since the drivers were outdated the primary server was not able to handle this and that is why we had that issue.

  • Preventive actions

Check and update all the drivers!

Did you face the same situation?

Advertisements

Azure SQL DB or AWS RDS SQL Server, That is the question…

Hi Gonzalo! We want to migrate our on Prem/IaaS Servers to PaaS, what provider should I choose?

Hi SQL Server Lovers!

Today, I want to write about Platform as a Service…. Let me start with Microsoft’s definition about What is PaaS?

https://azure.microsoft.com/en-us/overview/what-is-paas/

Platform as a service (PaaS) is a complete development and deployment environment in the cloud, with resources that enable you to deliver everything from simple cloud-based apps to sophisticated, cloud-enabled enterprise applications. You purchase the resources you need from a cloud service provider on a pay-as-you-go basis and access them over a secure Internet connection.

I’m working with Azure SQL DB and Amazon RDS SQL Server.. Below, my opinion about those offerings:

  • SQL Azure DB:

Overview:

Azure SQL Database is the intelligent, fully-managed relational cloud database service that provides the broadest SQL Server engine compatibility, so you can migrate your SQL Server databases without changing your apps. Accelerate app development and make maintenance easy and productive using the SQL tools you love to use. Take advantage of built-in intelligence that learns app patterns and adapts to maximize performance, reliability, and data protection.

Opinion:

Microsoft is investing a lot on this new technology, every new feature that will be released on SQL Server on prem/IaaS is tested first in Azure SQL DB. You can use GEO Replication to move data across different locations.. And also this new feature called Azure SQL Database auto-failover groups ( which is in-preview) to have automatic failovers in case of a failure on the primary node. On this link you can compare which are the features supported on Azure SQL DB vs IaaS/On Prem -> https://docs.microsoft.com/en-us/azure/sql-database/sql-database-features. There are security features like Auditing & Thread detection, vulnerability assessment  or  Data Discovery & classification report that will help you to identify any possible thread/ data breach on your system. In terms of performance you have the new feature automatic tuning or one of my favorites things in azure, Query Performance Insight.. A report that will help you to identify your bottlenecks.

As I talked about good things…. Let me talk about other things also.. In Azure SQL DB I found two big issues:

  1. Migration

Today there are 4 ways to migrate your database to Azure SQL DB:

  1. Using DMA (Data Migration Assistant)
  2. Using DMS (Database Migration Service)
  3. Exporting/Importing your database using a BACPAC (Also you can do this using Management Studio)
  4. Using the Azure SQL DB Database as a subscriber in a replication

All of this approaches gave me headaches… Simple example…. Using BACPAC approach.. I had errors because my database was using Domain users/group… Since azure DB use Azure Authentication I had to remove all the users from the database, export the bacpac again and move all the users one by one using the azure approach (User@company.com). There is a new feature Azure Managed instance -> https://docs.microsoft.com/en-us/azure/sql-database/sql-database-managed-instance that will allow us to perform native backup restores… I’m trying to test this but I’m waiting from Microsoft’s confirmation to create this new project…..

2. Reducing Costs

“pay-as-you-go” As Microsoft described before from my point of view… It is not applying to Azure SQL DB… There is no way to reduce the costs in azure sql db rather than doing a downgrade of your sql azure db.. You cannot stop the “Virtual Server” or the database itself (In fact there are some suggestions opened in Azure feedback page but with no positive response yet ->  https://feedback.azure.com/forums/217321-sql-database/suggestions/8739727-stopping-sql-azure-temporarily

or

https://stackoverflow.com/questions/26986213/stopping-sql-azure-db-when-not-in-use

I hope in the future we have this ability in order to save some costs.

  • Amazon RDS SQL Server

Overview

Amazon RDS for SQL Server is a managed service that is designed for developers who require the features and capabilities of SQL Server for building a new application. You can also move existing applications that utilize SQL Server to Amazon RDS without having to rewrite the application completely. Since Amazon RDS for SQL Server provides you direct access to the native capabilities of the SQL Server edition you have specified, your applications and tools should work seamlessly.

Opinion

I worked a lot with this offering… The best thing is that this support native backup and restore (using S3 buckets) and you can restore a native RDS SQL Server backup into on prem/IaaS. Being honestly, AWS provide the same SQL Server that you have in on prem/IaaS but hosted in their cloud. Now you can Scale up storage (some months ago it was a limitation.. but not now). This are the feature limitations -> https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_SQLServer.html

Problems in RDS SQL Server:

  1. Unlike SQL Azure DB you can stop the RDS SQL Server instance to reduce costs.. The thing is after one week AWS Will turn on the instance automatically.. So, yes you can reduce costs but not permanently.. A workaround will be ( if you don’t want to remove or use the instance now, you can execute a manual snapshot and save it for some time)
  2. We had some email notifications about restarts… Internally AWS had to apply some patches on the machines that host those instances… Even it is PaaS we had patch downtime….
  3. It is not true that AWS will patch your SQL Server up to the latest version. AWS will give you the opportunity to choose among different version. The one that will have to choose is you. For example, if you are using SQL 2014 SP 2 (and AWS didn’t remove this version from the catalog) you will have to perform the upgrade to the latest build on the AWS portal by yourself…. Yes, it is true that you will have to perform 2 clicks… But you will have downtime also.. We opened a case to them, and they say that they will not upgrade SQL Server unless there is a security breach or AWS remove the version from the portal (CU patches are “minor upgrades” not major upgrades..).

Are you using this new platform? if you do, please share your experiences too!

Listener Multisubnet + ETL Delay = Chaos!

After I migrated to SQL Server Always On HA feature using listener multisubnets, my ETLs are not working! please help me!

Hi SQL Server Lovers!

This time, I’ll tell you one strange situation that happened this week.. Last Monday I started my shift Drinking Mate ( Argentinian drink 🙂 ) and I received a Skype ping saying….. “Gonzalo we migrated to SQL Server Always on last weekend and our interfaces didn’t run properly…. Normally they run in 20 minutes and yesterday we had to stop them after 5 hours running.. We need your immediate assistance because our customers are waiting for their reports”.

So… I had a call with my customer to start troubleshooting the issue. This is our topology:

  • DW application ( They receive data from others apps )
  • Remote Batch server where app team execute the DTSX packages
  • Database Servers -> 2 Nodes, SQL Server 2016 sp1 + Always on feature + Listener multisubnet

Problem:

They were calling their Integration Services packages from a remote batch server ( they were using File deployment option for their packages ) using DTEXEC command line with multisubnetfailover=True attribute on their connection strings. Once they run their packages there were no connections to the database or the connections start and finish quickly and the process was running really slow.

One DTSX particular Case:

They moved 40 tables from one system to this DW in 20 minutes.. And having this issue, after 50 minutes only 5 were completed. We checked the network but it was fine.. The same for storage, memory, cpu, execution plans, indexes, statistics and we didn’t find any bottleneck. We tested also changing the connection string to point to the primary instead of listener and the interface ran as expected.

Solution

After changing the connection string to point to the primary…. One idea came to my mind….. I performed a ping to the listener name on the batch server itself and it was not responding ( The DNS record on the cache was pointing to the offline IP). So, I flushed the DNS ( IPCONFIG /FLUSHDNS), changed the connection string and tested the interface again…. The result? Successful on the expected time!

Workaround:

Here you have three ways:

  1. Manually perform IPCONFIG /FLUSHDNS from the server that you are executing the SSIS DTSX package.

2. You can modify RegisterAllProvidersIP (Be careful because your DNS will have only one entry and MULTISUBNETFAILOVER clause will not work….) parameter on the cluster for the listener to 0 and reduce the HostRecordTTL parameter to 120:

Commands:

***** This will require downtime because you will have to restart the listener cluster resource *****

Open a powershell window on any cluster node and execute the following commands:

( You can execute Get-ClusterResource without any parameter to get the listener resource name )

Get-ClusterResource Listener name | Set-ClusterParameter -Name HostRecordTTL -Value 120

Get-ClusterResource Listener name | Set-ClusterParameter -Name RegisterAllProvidersIP -Value 0

Stop-ClusterResource Listener name

Start-ClusterResource Listener name

3) The last option is the case that we performed… Since app team have multiple environments accessing this system and they had more than 50 packages to review and determine the reason of the failure…. I took the fast way ( I like the KISS architecture concept…).. Creating this script in order to update the windows hostfile to cache permanently the listener value ( App team executed all the interfaces only from one particular server….) to make sure that the listener will only point to the current (online) IP:

TRY{   
      $logfile="LogFileAddressfolder\Hostexecution.txt"   
      $whereami=0      
      #Checking Listener Connection      
      IF (test-Connection -ComputerName ListenerName -Count 1 -Quiet )    
      {        
         #If the listener is pinging the script will not do anything         
         write-output "ListenerName   is alive and Pinging " > $logfile            
      }   
      ELSE    
      {           
         #Getting Host Configuration      
         $hostfile = get-content "c:\Windows\System32\drivers\etc\hosts"       
         $whereami= 10       
         write-output $whereami > $logfile         
      
         #Searching if the listener value exists on the hostfile      
         IF($hostfile | Select-String -Pattern "Listenername")      
         {       
            #If the listener value exists on the hostfile the script will replace it with a blank value           
            $old_value=$hostfile | Select-String -Pattern "ListenerName"          
            $new_value=""                                                       
            (Get-Content c:\Windows\System32\drivers\etc\hosts).replace($old_value,$new_value) | Set-Content c:\Windows\System32\drivers\etc\hosts         
            $whereami= 20         
            write-output $whereami >> $logfile      
         }      
         
         #Performing the connection to the listener name using MULTISUBNETFAILOVER clause to get the online IP         
         $Conn = New-Object System.Data.SQLClient.SQLConnection      
         $Conn.ConnectionString = "server=ListenerName;database=Master;trusted_connection=true;MULTISUBNETFAILOVER=TRUE";      
         $Conn.Open()      
         $Command = New-Object System.Data.SQLClient.SQLCommand      
         $Command.Connection = $Conn 
         #Getting the current online ip from the always on configuration     
         $Command.CommandText = "SELECT ip_address FROM SYS.AVAILABILITY_GROUP_LISTENER_IP_ADDRESSES WHERE state_desc='ONLINE'";      
         $Command.CommandTimeOut = 0      
         $Read = $Command.ExecuteReader();      
         $Read.Read()      
         $whereami= 30      
         write-output $whereami >> $logfile      
         $ipactive = $Read.GetValue(0);                               
         $ip=$ipactive      
         #Adding the new ListenerName row on the hostfile         
         add-content -Encoding UTF8  "c:\windows\system32\Drivers\etc\hosts" "$ip ListenerName.corporatedomain.com ListenerName"       
         $whereami= 40      
         write-output $whereami >> $logfile     
     }
}
CATCH [Exception]
{    
   Write-output "ERROR -----> Error occured: $_.Exception" >> $logfile   
   write-output "Last where am I flag line:" $whereami >> $logfile   
   Start-Sleep -s 5
}

Also to execute this continuously, I created a task scheduler action and trigger it every 1 minute (So we ensure that after 1 minute the hostfile will be updated). Important thing… Updating the hostfile will cause that MULTISUBNETFAILOVER clause won’t WORK!!! So app team agreed on this risk.

Preventive action:

Perform all the test in staging before going live in prod! Test failovers, performance everything before moving on with any change!

Did you face the same situation? Please share your comments!

I hope this helps!

Replication= Bye Bye, Distribution Database!

We are in the middle of a migration and we need your help urgently! Replication is not working and the distribution database was deleted! The maintenance window is close to finish.

Hi, SQL Server Lovers!

Today I’ll talk about one critical situation that happened to me this week. Last Friday, the distribution database was accidentally deleted in one of our apps…. A team from the app was migrating from 2008 R2 to 2016 and they had this big issue:

  • Topology

Publisher and distributor on the same server with SQL Server 2016 SP1 Version

Subscriber with SQL Server 2016 SP1 Version

  • Issue:

A DBA tried to configure replication, but since there were some issues, he tried to remove the replication and start from scratch. Unfortunately, the disable part was not working, so he decided to execute this command:

PLEASE, IF YOU HAVE THIS SITUATION DO NOT DELETE THE DISTRIBUTION DATABASE!

FIX THE ISSUE THAT YOU ARE GETTING TO DISABLE THE REPLICATION!

ALTER DATABASE DISTRIBUTION SET OFFLINE;

GO

DROP DATABASE DISTRIBUTION;

GO

Which means that distribution DB was deleted… Obviously he tried to disable replication and he couldn’t:

Error:

Msg 911, Level 16, State 4, Procedure sp_MSrepl_helpsubscriberinfo, Line 120 [Batch Start Line 0]
Database ‘distribution’ does not exist. Make sure that the name is entered correctly.

  • Solution

So, here you have 2 ways:

  • Easy Way

After you delete the distribution database, the Database files remain on the same place (unless you delete them), so you should execute this script in order to create again the database:

CREATE DATABASE Distribution ON PRIMARY
( NAME = N’distribution’,

FILENAME = N’Last file path\Name.mdf’ ,

SIZE = 8192KB , FILEGROWTH = 65536KB )
LOG ON
( NAME = N’distribution_log’,

FILENAME = N’Last file path\Name.ldf’ , SIZE = 8192KB , FILEGROWTH = 65536KB )
FOR ATTACH;

Then you can disable again the replication:

  • Hard Way

After you deleted the distribution database, you also removed the files from the hard drivers and the recycle bin (Yes, believe it or not, this was my situation!). So, to fix this, you can follow this steps:

  1. Create a new distribution database:

USE MASTER

GO

EXEC SP_ADDDISTRIBUTIONDB @database = N’Distribution_New’,
@data_folder = ‘C:\Temp’,  –Choose your own folder name
@data_file = N’Distribution_New.MDF’, –Choose your own file name
@data_file_size = 128,
@log_folder = ‘C:\Temp’,  –Choose your own folder name
@log_file = N’Distribution_New.LDF’,  –Choose your own file name
@log_file_size = 64,
@min_distretention = 0,
@max_distretention = 72,
@history_retention = 48,
@security_mode = 1

2) Stop SQL Server Services

3) Copy/Paste the distribution_new files and rename them as the old distribution files database names

— Copy/Paste Distribution DB files

copy_pate

–Rename Both distribution DB files

Rename

4) Start SQL Server Service

5) Execute this script to attach the database again

CREATE DATABASE Distribution
ON PRIMARY
( NAME = N’distribution’, FILENAME = N’C:\temp\Distribution.mdf’ , SIZE = 8192KB , FILEGROWTH = 65536KB )
LOG ON
( NAME = N’distribution_log’, FILENAME = N’C:\temp\Distribution.ldf’ , SIZE = 8192KB , FILEGROWTH = 65536KB )
FOR ATTACH

6) Disable replication and distribution:

— Remove the local server as a Distributor.

USE master
GO

EXEC sp_dropdistributor @no_checks = 1;

Have you ever faced this same situation?? Can you add any insights?

Happy new year, sql fans!

Listener multisubnets + Replication= Where is the data??

“I migrated from standalone to HA using a listener multisubnet and replication is not working. I need your help immediately!”

Hi, SQL Server lovers!

Today I’ll talk about one situation that took me an entire year to fix/workaround/make a decision. Long story short: We migrated some instances from one data center to another. The particular situation was that we used a SQL Server Always on feature with listeners multisubnets, instead of single subnet.

Topology:

  • SQL Server 2014 Enterprise edition
  • Always on environment acting as publisher + Listener multisubnet
  • Remote Distributor
  • Always on Environment acting as subscriber + Listener multisubnet

The thing was that we didn’t know that some instances were using replication too…

So we configured a replication using a listener as a publisher and another listener as a suscriber (you can follow this post to do it this way).

Everything was working properly until one day I received a call saying: “Gonzalo: Replication is not working and our customers are waiting for the information for their reports. We need your help immediately because today we have to send the information to the leadership”. (Yes, as you can see that in a database world, everything is ASAP….).

After I connected to the instance, I saw this particular issue:

Unable to query the redirected server ‘Listener name’ for original publisher ‘Original publisher’ and publisher database ‘Database Name’ to determine the name of the remote server; Error 7303, Error message ‘Error 7303, Level 16, State 1, Message: Cannot initialize the data source object of OLE DB provider “SQLNCLI11” for linked server “Linked server hash”.’. (Source: MSSQL_REPL, Error number: MSSQL_REPL21879)

(You can also face the same situation on the subscriber as well. The error is: Cannot connect to subscriber Listener name)

So, after some hours, I connected to the distributor and saw that the publisher listener was not responding to the ping (it was pointing to the old ip). I executed a IPCONFIG /FLUSHDNS (2 or 3 times) and, after that, the listener started to ping and the situation solved.

But… It happened 3 or 4 times the same week = Leadership pushing for a permanent solution.

We opened a case to Microsoft and they said that the solution was to automatize this flushdns command. Some months ago, one Microsoft engineer posted a workaround, but these are my final findings on this issue:

  • Issue:

Replication use OLEDB driver to connect, which means that MULTISUBNET FAILOVER Clause is not supported and if the DNS change on your distributor replication will fail

  • Workarounds:
  1. Automate ipconfig /flushdns                    -> For us it didn’t work
  2. Reduce TTL parameter for the DNS        -> For us it didn’t work
  3. Create another listener on the always on + set parameter RegisterAllProvidersIP to 0 + HostRecordTTL = 60 + increase time out time for the replication profile -> The thing for this is that you will have 2 dependencies for the always on role and also…. In case of a failover  you will have to wait for the replication to start = No permanent solution
  • Our decission:

Since we have more than 30 Web servers pointing to the listener, we keep RegisterAllProvidersIP = 1 to use multisubnet listener clause and use a different approach to move data (Integration Services). Yes, it is true that we don’t have the data in seconds ( We moved the data every 5 minutes) but we achieve the challenge avoiding issues.

  • Permanent solution:

Some months ago, Pedro Lopez blogged about a new release for OLEDB driver, that could support MultisubnetFailover Clause. I hope that we can apply this new version to replication soon, so we can avoid this kind of issues!

Did you face the same situation? Did you use the same approach? Please share your findings too!

 

 

Columnstores: ETL Nightmare

“We are using partitioning and columnstore indexes but after our ETL process nothing is working as expected”

Hi, SQL Server lovers!

Today, I’ll talk about one situation that happened to me this week. I was starting my day smoothly, until I received one email saying: “Gonzalo, we need your help immediately. Our interfaces are not working. It seems that something is happening on the database engine”. So, I jumped into a bridge call and found some interesting things:

  • Topology

SQL Server 2016 SP 1 + Database using partitioning + Clustered Columnstore indexes

  • Issue

After they executed an ETL process, I found that a clustered columnstore index have more than 500 rowgroups for the same partition boundary with less than 100.000 rows each.

  • Qry

Credits to Nick Neugebauer (http://www.nikoport.com/columnstore/)

You can connect to this link -> https://github.com/NikoNeugebauer/CISL/tree/master/SQL-2016  And Download and execute this File -> row_groups_details.sql to check all the details in your columnstore indexes. 

  • Solution

I executed a rebuild only on the problematic partition boundary and that fixed our issue:

Command ->

ALTER INDEX Columnstore_Index on [Schema].[TableName] REBUILD PARTITION = 4 WITH (MAXDOP=1)

  • Preventive actions

Every time that you insert data on your columnstore indexes, check row group sizes and states (compressed opened etc etc) + Delta stores (open/closed)

I hope this helps!

Please use the comments for feedback, questions or just to suggest topic for next posts!