R Services in SQL 2016 Not responding? Maybe this is your issue!

Hi Gonzalo,
We installed R services in SQL 2016 SP2 and we are getting issues intermittently, can you please help us?

Hi SQL Lovers!

2 Weeks ago I was working in a very interesting case in SQL 2016. I received an email from one of my customers saying that they were having intermitent issues within their app that was executing some SP_Execute_External_script SP calls to the database.

  • Topology:

SQL Server 2016 SP 2 CU 4 Enterprise Standalone + R services

  • Issue from SQL Side:

issue

  • Details:

We performed some normal checks between SQL and OS Side (Permissions, event viewer logs, R logs etc etc) But we didn’t find any event or error that can point to our issue. From R log we just found this:

StaleDirectoryCleaner starting directory deletions…

StaleDirectoryCleaner completed directory deletions with 1

SQLSatellite_LauncherCleanUp, dllmain.cpp, 381

WaitAll(2, 5000) completed with 258

cleanup wait failed with 258 and error 0

We also restarted the launchpad service but with no luck….. The biggest challenge was that sometimes the service responded fine and sometimes it showed the issue that I paste above (And this for me was absolutelly new…). From SQL Side we executed an extended events session with all R services counters but nothing appeared… From sys.dm_Exec_session_Wait_stats we just observed that the session was waiting for SATELLITE_SERVICE_SETUP wait which points that SQL Was waiting an answer from the R service itself. After that, since it was not a sql issue we decided to open a case with Microsoft. At the begining, we executed some traces in Proc mon and process explorer and we found that the BXLServer service sometimes started and sometimes it was blocked by something (The firewall was down)… So then we pointed to the antivirus and filtered drivers. We uninstall it but it didn’t fix our issue.

  • Solution:

After our engineer reached the product team, they sent as the solution. There is a currently R bug within SQL 2016 (Fixed in SQL 2017). The issue is that R service will query the OS asking if HTTPS is enabled or not to get the latest repository version. Based on that call it will use https:// site…., and if it doesn’t respond, it sets to http:// site instead. In order to fix this we just modify the profile file which is located in C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\R_SERVICES\etc\Rprofile.site . We changed this:

httpsSupported <- RevoUtils:::httpsSupported()

to

httpsSupported <- F

Then, we restarted the launchpad service and that fixed our issue!! Here you have two ways, or you modify the launchapd config file (important thing…. If you modify this file and apply a CU or SP that file will be replaced and you will have to modify it again) or you migrate directly to SQL 2017.

Hope this helps!

 

 

 

 

 

 

 

Advertisements

Clustered instance patching = Failover issue = Master database corrupted???

Hi Gonzalo, We applied SP1 on the passive node and the failover didn’t work. We are not able to use our passive node now and we are still running on RTM, please help us!

Hi SQL Server Lovers!

Today, I want to talk about one situation that happened last week… I received an email saying that there were some issues within a failover cluster patching implementation.

  • Topology:

Failover clustered instance in SQL Server 2016 RTM with 2 nodes:

  • Issue:

SP 1 was applied on the passive node then the failover was executed and it failed.

  • Errors:

Event viewer:

1

SQL Server Error Log:

2

“Config Statement cannot be used inside a user transaction” or Cannot recover the master database… Interesting…. Below those scary messages, I found another interesting thing:

7.png

First: “Granting login access User name to MSDB database” and then “A problem was encountered granting access to MSDB database for login (null)”… That means that something was going on with that particular account.

  • Solution:

First, I started debugging the file that sql server was trying to execute:

File name:

msdb110_upgrade.sql

File location:

Instance default installation path\Install\msdb110_upgrade.sql

Inside this file, I searched for POST_SQLAGENT100.SQL section (above you can see that there is a message saying that “Execution of POST_SQLAGENT100.SQL was completed”) and then found that SQL was trying to execute this code:

–wals thru all non sysadmin job owners

DECLARE job_nonsysadmin_owners_cursor CURSOR LOCAL FOR

SELECT DISTINCT j.owner_sid FROM sysjobs j

FOR READ ONLY

 

OPEN job_nonsysadmin_owners_cursor

FETCH NEXT FROM job_nonsysadmin_owners_cursor INTO @owner_sid

WHILE (@@fetch_status = 0)

BEGIN

SELECT @owner_name = SUSER_SNAME(@owner_sid)

IF @owner_name IS NOT NULL

BEGIN

–is job owner member of sysadmin role?

BEGIN TRY

EXECUTE AS LOGIN=@owner_name — impersonate

SELECT @is_sysadmin = ISNULL(IS_SRVROLEMEMBER(‘sysadmin’),0) — check role membership

REVERT — revert back

END TRY

BEGIN CATCH

SET @is_sysadmin = 0

END CATCH

 

IF @is_sysadmin = 0

    BEGIN

                    –add job_owner to the SQLAgentUserRole msdb role in order to permit the job owner to handle his jobs

                    –has this login a user in msdb?

                    IF NOT EXISTS(SELECT * FROM sys.database_principals WHERE (sid = @owner_sid) OR (LOWER(name collate SQL_Latin1_General_CP1_CS_AS) = LOWER(@owner_name collate SQL_Latin1_General_CP1_CS_AS)))

                    BEGIN

                                    PRINT ”

                                    PRINT ‘Granting login access”’ + @owner_name + ”’ to msdb database…’

                                    BEGIN TRY

                                      EXEC sp_grantdbaccess @loginame = @owner_name

        END TRY

        BEGIN CATCH

          RAISERROR(‘A problem was encountered granting access to MSDB database for login ”%s”. Make sure this login is provisioned with SQLServer and rerun sqlagent_msdb_upgrade.sql ‘, 10, 127) WITH LOG

END CATCH

END

 

PRINT ”

PRINT ‘Adding user ”’ + @owner_name + ”’ to SQLAgentUserRole msdb role…’

BEGIN TRY

EXEC sp_addrolemember @rolename = ‘SQLAgentUserRole’, @membername = @owner_name

END TRY

BEGIN CATCH

RAISERROR(‘A problem was encountered adding user ”%s” to SQLAgentUserRole. Make sure this is a valid user in MSDB database and rerun sqlagent_msdb_upgrade.sql ‘, 10, 127) WITH LOG

END CATCH

END

END

FETCH NEXT FROM job_nonsysadmin_owners_cursor INTO @owner_sid

END

DEALLOCATE job_nonsysadmin_owners_cursor

 

So, I started checking this account state + permissions (Disabled without any permissions….). Next step? Checking all job owners (As you can see at the beginning of this cursor: SELECT DISTINCT j.owner_sid FROM sysjobs j).. and VOILÀ! This account owned a job! To fix this situation, we changed the job owner to SA so the cursor can avoid that particular user and tried the failover again… But this time IT WORKED!!! After that, we were able to apply SP1 on the other node as well and issue fixed :):)!

  • Similar Case:

https://blog.sqlauthority.com/2016/10/08/sql-server-unable-start-services-patching-sqlagent_msdb_upgrade-sql/

Did you face the same situation?

Gonzalo.

Asynch AOG + Cluster node network issues= DOWNTIME!!!

Hi Gonzalo! We are in our critical month day and the SQL Server is not responding! We are seeing this issue “Unable to access availability database ‘xxxx’ because the database replica is not in the PRIMARY or SECONDARY role”. Please help us!

Hi SQL Server Lovers!

Today I’ll talk about a critical situation that happened some weeks ago.. While I was drinking my favorite Argentinian drink (Mate), one of the app team members pinged me saying… Gonzalo! We need you immediatelly (Yes.. I know that you hear this many times….) We are having a P1 in production and it is our critical day! So, I connected to the instance… And this is the history:

  • Topology:

Windows Server 2012 R2 Cluster with 2 nodes + Fileshare witness and SQL Server 2014 AOG Role in Asynch commit located in AWS Data Center.

  • Issue:

From SQL Server side I saw this issue -> “Unable to access availability database ‘xxxx’ because the database replica is not in the PRIMARY or SECONDARY role” Which means that the AOG role was not online.

From cluster side I saw this critical event -> Cluster node ‘xxxx’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. Which means that the primary node lost connection with the other cluster members.

  • Primary assumptions:

Before this issue started to happen… SQL Server 2014 SP2 CU10 was applied.. So the easy way was to uninstall the patch…. After 10 minutes the issue started to happen again.. Next step? Windows patches.. The same day some patches were applied.. Easy way 2.. Uninstall windows patches.. Guess what? After 10 minutes Again…. The main problem was that the primary SQL Server node was lossing connections with the other cluster members and since it was the primary.. The aog went down which means… You know what it means….

  • Seeing the light at the end of the tunnel

After checking with network, cloud and infrastructure teams and following this post also -> https://blogs.technet.microsoft.com/askcore/2012/07/09/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership/ Next step was… “Hey why don’t we check the drivers?”… In AWS the drivers that manage the relationship between the hypervisor and the physical components are PV Drivers -> https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/xen-drivers-overview.html After we checked this, we saw that the version on those boxes were outdated…So, the infrastructure team updated those drivers on all of the servers and…. Voilá! The issue dissapeared.

  • RCA

After having some calls with infrastructure team, they realized that the Nics on the server were having a huge load because of the pick day.. Since the drivers were outdated the primary server was not able to handle this and that is why we had that issue.

  • Preventive actions

Check and update all the drivers!

Did you face the same situation?

Azure SQL DB or AWS RDS SQL Server, That is the question…

Hi Gonzalo! We want to migrate our on Prem/IaaS Servers to PaaS, what provider should I choose?

Hi SQL Server Lovers!

Today, I want to write about Platform as a Service…. Let me start with Microsoft’s definition about What is PaaS?

https://azure.microsoft.com/en-us/overview/what-is-paas/

Platform as a service (PaaS) is a complete development and deployment environment in the cloud, with resources that enable you to deliver everything from simple cloud-based apps to sophisticated, cloud-enabled enterprise applications. You purchase the resources you need from a cloud service provider on a pay-as-you-go basis and access them over a secure Internet connection.

I’m working with Azure SQL DB and Amazon RDS SQL Server.. Below, my opinion about those offerings:

  • SQL Azure DB:

Overview:

Azure SQL Database is the intelligent, fully-managed relational cloud database service that provides the broadest SQL Server engine compatibility, so you can migrate your SQL Server databases without changing your apps. Accelerate app development and make maintenance easy and productive using the SQL tools you love to use. Take advantage of built-in intelligence that learns app patterns and adapts to maximize performance, reliability, and data protection.

Opinion:

Microsoft is investing a lot on this new technology, every new feature that will be released on SQL Server on prem/IaaS is tested first in Azure SQL DB. You can use GEO Replication to move data across different locations.. And also this new feature called Azure SQL Database auto-failover groups ( which is in-preview) to have automatic failovers in case of a failure on the primary node. On this link you can compare which are the features supported on Azure SQL DB vs IaaS/On Prem -> https://docs.microsoft.com/en-us/azure/sql-database/sql-database-features. There are security features like Auditing & Thread detection, vulnerability assessment  or  Data Discovery & classification report that will help you to identify any possible thread/ data breach on your system. In terms of performance you have the new feature automatic tuning or one of my favorites things in azure, Query Performance Insight.. A report that will help you to identify your bottlenecks.

As I talked about good things…. Let me talk about other things also.. In Azure SQL DB I found two big issues:

  1. Migration

Today there are 4 ways to migrate your database to Azure SQL DB:

  1. Using DMA (Data Migration Assistant)
  2. Using DMS (Database Migration Service)
  3. Exporting/Importing your database using a BACPAC (Also you can do this using Management Studio)
  4. Using the Azure SQL DB Database as a subscriber in a replication

All of this approaches gave me headaches… Simple example…. Using BACPAC approach.. I had errors because my database was using Domain users/group… Since azure DB use Azure Authentication I had to remove all the users from the database, export the bacpac again and move all the users one by one using the azure approach (User@company.com). There is a new feature Azure Managed instance -> https://docs.microsoft.com/en-us/azure/sql-database/sql-database-managed-instance that will allow us to perform native backup restores… I’m trying to test this but I’m waiting from Microsoft’s confirmation to create this new project…..

2. Reducing Costs

“pay-as-you-go” As Microsoft described before from my point of view… It is not applying to Azure SQL DB… There is no way to reduce the costs in azure sql db rather than doing a downgrade of your sql azure db.. You cannot stop the “Virtual Server” or the database itself (In fact there are some suggestions opened in Azure feedback page but with no positive response yet ->  https://feedback.azure.com/forums/217321-sql-database/suggestions/8739727-stopping-sql-azure-temporarily

or

https://stackoverflow.com/questions/26986213/stopping-sql-azure-db-when-not-in-use

I hope in the future we have this ability in order to save some costs.

  • Amazon RDS SQL Server

Overview

Amazon RDS for SQL Server is a managed service that is designed for developers who require the features and capabilities of SQL Server for building a new application. You can also move existing applications that utilize SQL Server to Amazon RDS without having to rewrite the application completely. Since Amazon RDS for SQL Server provides you direct access to the native capabilities of the SQL Server edition you have specified, your applications and tools should work seamlessly.

Opinion

I worked a lot with this offering… The best thing is that this support native backup and restore (using S3 buckets) and you can restore a native RDS SQL Server backup into on prem/IaaS. Being honestly, AWS provide the same SQL Server that you have in on prem/IaaS but hosted in their cloud. Now you can Scale up storage (some months ago it was a limitation.. but not now). This are the feature limitations -> https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_SQLServer.html

Problems in RDS SQL Server:

  1. Unlike SQL Azure DB you can stop the RDS SQL Server instance to reduce costs.. The thing is after one week AWS Will turn on the instance automatically.. So, yes you can reduce costs but not permanently.. A workaround will be ( if you don’t want to remove or use the instance now, you can execute a manual snapshot and save it for some time)
  2. We had some email notifications about restarts… Internally AWS had to apply some patches on the machines that host those instances… Even it is PaaS we had patch downtime….
  3. It is not true that AWS will patch your SQL Server up to the latest version. AWS will give you the opportunity to choose among different version. The one that will have to choose is you. For example, if you are using SQL 2014 SP 2 (and AWS didn’t remove this version from the catalog) you will have to perform the upgrade to the latest build on the AWS portal by yourself…. Yes, it is true that you will have to perform 2 clicks… But you will have downtime also.. We opened a case to them, and they say that they will not upgrade SQL Server unless there is a security breach or AWS remove the version from the portal (CU patches are “minor upgrades” not major upgrades..).

Are you using this new platform? if you do, please share your experiences too!

Listener Multisubnet + ETL Delay = Chaos!

After I migrated to SQL Server Always On HA feature using listener multisubnets, my ETLs are not working! please help me!

Hi SQL Server Lovers!

This time, I’ll tell you one strange situation that happened this week.. Last Monday I started my shift Drinking Mate ( Argentinian drink 🙂 ) and I received a Skype ping saying….. “Gonzalo we migrated to SQL Server Always on last weekend and our interfaces didn’t run properly…. Normally they run in 20 minutes and yesterday we had to stop them after 5 hours running.. We need your immediate assistance because our customers are waiting for their reports”.

So… I had a call with my customer to start troubleshooting the issue. This is our topology:

  • DW application ( They receive data from others apps )
  • Remote Batch server where app team execute the DTSX packages
  • Database Servers -> 2 Nodes, SQL Server 2016 sp1 + Always on feature + Listener multisubnet

Problem:

They were calling their Integration Services packages from a remote batch server ( they were using File deployment option for their packages ) using DTEXEC command line with multisubnetfailover=True attribute on their connection strings. Once they run their packages there were no connections to the database or the connections start and finish quickly and the process was running really slow.

One DTSX particular Case:

They moved 40 tables from one system to this DW in 20 minutes.. And having this issue, after 50 minutes only 5 were completed. We checked the network but it was fine.. The same for storage, memory, cpu, execution plans, indexes, statistics and we didn’t find any bottleneck. We tested also changing the connection string to point to the primary instead of listener and the interface ran as expected.

Solution

After changing the connection string to point to the primary…. One idea came to my mind….. I performed a ping to the listener name on the batch server itself and it was not responding ( The DNS record on the cache was pointing to the offline IP). So, I flushed the DNS ( IPCONFIG /FLUSHDNS), changed the connection string and tested the interface again…. The result? Successful on the expected time!

Workaround:

Here you have three ways:

  1. Manually perform IPCONFIG /FLUSHDNS from the server that you are executing the SSIS DTSX package.

2. You can modify RegisterAllProvidersIP (Be careful because your DNS will have only one entry and MULTISUBNETFAILOVER clause will not work….) parameter on the cluster for the listener to 0 and reduce the HostRecordTTL parameter to 120:

Commands:

***** This will require downtime because you will have to restart the listener cluster resource *****

Open a powershell window on any cluster node and execute the following commands:

( You can execute Get-ClusterResource without any parameter to get the listener resource name )

Get-ClusterResource Listener name | Set-ClusterParameter -Name HostRecordTTL -Value 120

Get-ClusterResource Listener name | Set-ClusterParameter -Name RegisterAllProvidersIP -Value 0

Stop-ClusterResource Listener name

Start-ClusterResource Listener name

3) The last option is the case that we performed… Since app team have multiple environments accessing this system and they had more than 50 packages to review and determine the reason of the failure…. I took the fast way ( I like the KISS architecture concept…).. Creating this script in order to update the windows hostfile to cache permanently the listener value ( App team executed all the interfaces only from one particular server….) to make sure that the listener will only point to the current (online) IP:

TRY{   
      $logfile="LogFileAddressfolder\Hostexecution.txt"   
      $whereami=0      
      #Checking Listener Connection      
      IF (test-Connection -ComputerName ListenerName -Count 1 -Quiet )    
      {        
         #If the listener is pinging the script will not do anything         
         write-output "ListenerName   is alive and Pinging " > $logfile            
      }   
      ELSE    
      {           
         #Getting Host Configuration      
         $hostfile = get-content "c:\Windows\System32\drivers\etc\hosts"       
         $whereami= 10       
         write-output $whereami > $logfile         
      
         #Searching if the listener value exists on the hostfile      
         IF($hostfile | Select-String -Pattern "Listenername")      
         {       
            #If the listener value exists on the hostfile the script will replace it with a blank value           
            $old_value=$hostfile | Select-String -Pattern "ListenerName"          
            $new_value=""                                                       
            (Get-Content c:\Windows\System32\drivers\etc\hosts).replace($old_value,$new_value) | Set-Content c:\Windows\System32\drivers\etc\hosts         
            $whereami= 20         
            write-output $whereami >> $logfile      
         }      
         
         #Performing the connection to the listener name using MULTISUBNETFAILOVER clause to get the online IP         
         $Conn = New-Object System.Data.SQLClient.SQLConnection      
         $Conn.ConnectionString = "server=ListenerName;database=Master;trusted_connection=true;MULTISUBNETFAILOVER=TRUE";      
         $Conn.Open()      
         $Command = New-Object System.Data.SQLClient.SQLCommand      
         $Command.Connection = $Conn 
         #Getting the current online ip from the always on configuration     
         $Command.CommandText = "SELECT ip_address FROM SYS.AVAILABILITY_GROUP_LISTENER_IP_ADDRESSES WHERE state_desc='ONLINE'";      
         $Command.CommandTimeOut = 0      
         $Read = $Command.ExecuteReader();      
         $Read.Read()      
         $whereami= 30      
         write-output $whereami >> $logfile      
         $ipactive = $Read.GetValue(0);                               
         $ip=$ipactive      
         #Adding the new ListenerName row on the hostfile         
         add-content -Encoding UTF8  "c:\windows\system32\Drivers\etc\hosts" "$ip ListenerName.corporatedomain.com ListenerName"       
         $whereami= 40      
         write-output $whereami >> $logfile     
     }
}
CATCH [Exception]
{    
   Write-output "ERROR -----> Error occured: $_.Exception" >> $logfile   
   write-output "Last where am I flag line:" $whereami >> $logfile   
   Start-Sleep -s 5
}

Also to execute this continuously, I created a task scheduler action and trigger it every 1 minute (So we ensure that after 1 minute the hostfile will be updated). Important thing… Updating the hostfile will cause that MULTISUBNETFAILOVER clause won’t WORK!!! So app team agreed on this risk.

Preventive action:

Perform all the test in staging before going live in prod! Test failovers, performance everything before moving on with any change!

Did you face the same situation? Please share your comments!

I hope this helps!

Replication= Bye Bye, Distribution Database!

We are in the middle of a migration and we need your help urgently! Replication is not working and the distribution database was deleted! The maintenance window is close to finish.

Hi, SQL Server Lovers!

Today I’ll talk about one critical situation that happened to me this week. Last Friday, the distribution database was accidentally deleted in one of our apps…. A team from the app was migrating from 2008 R2 to 2016 and they had this big issue:

  • Topology

Publisher and distributor on the same server with SQL Server 2016 SP1 Version

Subscriber with SQL Server 2016 SP1 Version

  • Issue:

A DBA tried to configure replication, but since there were some issues, he tried to remove the replication and start from scratch. Unfortunately, the disable part was not working, so he decided to execute this command:

PLEASE, IF YOU HAVE THIS SITUATION DO NOT DELETE THE DISTRIBUTION DATABASE!

FIX THE ISSUE THAT YOU ARE GETTING TO DISABLE THE REPLICATION!

ALTER DATABASE DISTRIBUTION SET OFFLINE;

GO

DROP DATABASE DISTRIBUTION;

GO

Which means that distribution DB was deleted… Obviously he tried to disable replication and he couldn’t:

Error:

Msg 911, Level 16, State 4, Procedure sp_MSrepl_helpsubscriberinfo, Line 120 [Batch Start Line 0]
Database ‘distribution’ does not exist. Make sure that the name is entered correctly.

  • Solution

So, here you have 2 ways:

  • Easy Way

After you delete the distribution database, the Database files remain on the same place (unless you delete them), so you should execute this script in order to create again the database:

CREATE DATABASE Distribution ON PRIMARY
( NAME = N’distribution’,

FILENAME = N’Last file path\Name.mdf’ ,

SIZE = 8192KB , FILEGROWTH = 65536KB )
LOG ON
( NAME = N’distribution_log’,

FILENAME = N’Last file path\Name.ldf’ , SIZE = 8192KB , FILEGROWTH = 65536KB )
FOR ATTACH;

Then you can disable again the replication:

  • Hard Way

After you deleted the distribution database, you also removed the files from the hard drivers and the recycle bin (Yes, believe it or not, this was my situation!). So, to fix this, you can follow this steps:

  1. Create a new distribution database:

USE MASTER

GO

EXEC SP_ADDDISTRIBUTIONDB @database = N’Distribution_New’,
@data_folder = ‘C:\Temp’,  –Choose your own folder name
@data_file = N’Distribution_New.MDF’, –Choose your own file name
@data_file_size = 128,
@log_folder = ‘C:\Temp’,  –Choose your own folder name
@log_file = N’Distribution_New.LDF’,  –Choose your own file name
@log_file_size = 64,
@min_distretention = 0,
@max_distretention = 72,
@history_retention = 48,
@security_mode = 1

2) Stop SQL Server Services

3) Copy/Paste the distribution_new files and rename them as the old distribution files database names

— Copy/Paste Distribution DB files

copy_pate

–Rename Both distribution DB files

Rename

4) Start SQL Server Service

5) Execute this script to attach the database again

CREATE DATABASE Distribution
ON PRIMARY
( NAME = N’distribution’, FILENAME = N’C:\temp\Distribution.mdf’ , SIZE = 8192KB , FILEGROWTH = 65536KB )
LOG ON
( NAME = N’distribution_log’, FILENAME = N’C:\temp\Distribution.ldf’ , SIZE = 8192KB , FILEGROWTH = 65536KB )
FOR ATTACH

6) Disable replication and distribution:

— Remove the local server as a Distributor.

USE master
GO

EXEC sp_dropdistributor @no_checks = 1;

Have you ever faced this same situation?? Can you add any insights?

Happy new year, sql fans!

Listener multisubnets + Replication= Where is the data??

“I migrated from standalone to HA using a listener multisubnet and replication is not working. I need your help immediately!”

Hi, SQL Server lovers!

Today I’ll talk about one situation that took me an entire year to fix/workaround/make a decision. Long story short: We migrated some instances from one data center to another. The particular situation was that we used a SQL Server Always on feature with listeners multisubnets, instead of single subnet.

Topology:

  • SQL Server 2014 Enterprise edition
  • Always on environment acting as publisher + Listener multisubnet
  • Remote Distributor
  • Always on Environment acting as subscriber + Listener multisubnet

The thing was that we didn’t know that some instances were using replication too…

So we configured a replication using a listener as a publisher and another listener as a suscriber (you can follow this post to do it this way).

Everything was working properly until one day I received a call saying: “Gonzalo: Replication is not working and our customers are waiting for the information for their reports. We need your help immediately because today we have to send the information to the leadership”. (Yes, as you can see that in a database world, everything is ASAP….).

After I connected to the instance, I saw this particular issue:

Unable to query the redirected server ‘Listener name’ for original publisher ‘Original publisher’ and publisher database ‘Database Name’ to determine the name of the remote server; Error 7303, Error message ‘Error 7303, Level 16, State 1, Message: Cannot initialize the data source object of OLE DB provider “SQLNCLI11” for linked server “Linked server hash”.’. (Source: MSSQL_REPL, Error number: MSSQL_REPL21879)

(You can also face the same situation on the subscriber as well. The error is: Cannot connect to subscriber Listener name)

So, after some hours, I connected to the distributor and saw that the publisher listener was not responding to the ping (it was pointing to the old ip). I executed a IPCONFIG /FLUSHDNS (2 or 3 times) and, after that, the listener started to ping and the situation solved.

But… It happened 3 or 4 times the same week = Leadership pushing for a permanent solution.

We opened a case to Microsoft and they said that the solution was to automatize this flushdns command. Some months ago, one Microsoft engineer posted a workaround, but these are my final findings on this issue:

  • Issue:

Replication use OLEDB driver to connect, which means that MULTISUBNET FAILOVER Clause is not supported and if the DNS change on your distributor replication will fail

  • Workarounds:
  1. Automate ipconfig /flushdns                    -> For us it didn’t work
  2. Reduce TTL parameter for the DNS        -> For us it didn’t work
  3. Create another listener on the always on + set parameter RegisterAllProvidersIP to 0 + HostRecordTTL = 60 + increase time out time for the replication profile -> The thing for this is that you will have 2 dependencies for the always on role and also…. In case of a failover  you will have to wait for the replication to start = No permanent solution
  • Our decission:

Since we have more than 30 Web servers pointing to the listener, we keep RegisterAllProvidersIP = 1 to use multisubnet listener clause and use a different approach to move data (Integration Services). Yes, it is true that we don’t have the data in seconds ( We moved the data every 5 minutes) but we achieve the challenge avoiding issues.

  • Permanent solution:

Some months ago, Pedro Lopez blogged about a new release for OLEDB driver, that could support MultisubnetFailover Clause. I hope that we can apply this new version to replication soon, so we can avoid this kind of issues!

Did you face the same situation? Did you use the same approach? Please share your findings too!