NOLOCK without NOLOCK, page latch time-out on tempdb and finally DUMP…

Infrastructure

Starting a new installation of SQL Server 2019 Enterprise (15.0.2000.5 / X64 on Windows Server 2019 Standard 10.0 / Build 17763) on a new physical machine (2 machines for AlwaysOn replicas).
The new machines are a LENOVO :
* ThinkSystem SR630 – [7X02CTO1WW]
* 1 CPU : 1 Xeon Gold 6208U – 2.90 Ghz (16 cores x 2 – hyperthreading)
* 256 Gb RAM (32 Gb x 8)
There are 2 new identical machines and the problem appears systematically on the 2 machines…

The tests that generate the malfunctions are as follows:
1) creation of a database
2) enlarging the database files and switching to SIMPLE recovery mode
3) creation of a table
4) filling the table with 1 million rows
5) execution of the test request (calculation of aggregated time intervals – collapsing)

It is this query (executed almost 10 times) which very often goes into error with the following message:

Msg 601, Level 12, State 1, Line … Could not continue scan with NOLOCK due to data movement.

Of course, at no time did we implement the NOLOCK hint or the READ UNCOMMITTED isolation level.
But… information is written to the SQL Server event log:
1) time-out for buffer latch in pages of tempdb
2) sometimes a dump is generated

Example of « buffer latch » messages :
A time-out occurred while waiting for buffer latch — type 4, bp 00000292CE3D60C0, page 9:18634, stat 0x10b, database id: 2, allocation unit Id: 422212527063040/140737550352384, task 0x00000292AB073468 : 9, waittime 300 seconds, flags 0x100000001a, owning task 0x00000292AB06B848. Not continuing to wait.
A time-out occurred while waiting for buffer latch — type 4, bp 00000292CE398340, page 6:10372, stat 0x10b, database id: 2, allocation unit Id: 422212527063040/140737550352384, task 0x00000292AB07B468 : 8, waittime 300 seconds, flags 0x1a, owning task 0x00000292AB073468. Not continuing to wait.
A time-out occurred while waiting for buffer latch — type 4, bp 00000292CE3DC480, page 9:18655, stat 0x10b, database id: 2, allocation unit Id: 422212527063040/140737550352384, task 0x00000292AB703C28 : 12, waittime 300 seconds, flags 0x1a, owning task 0x00000292AB07B468. Not continuing to wait.

Our investigations:
* Installing CU6 and after CU7 does not solve the problem
* Disk failure has been sweep of, because we create the database on the 3 differents hard disk drives and the problem remains, even when we move the tempdb.
* The « buffer latch » information indicate likely a memory fault. But no hardware informations appear about the RAM in the servers.
* While soft NUMA is enable, we do complementary tests by disabling SOFT NUMA, but is does not solve the problem.

Some elements that seems to decrease the occurrence of the phenomenon:
1) disabling SOFT NUMA (with hyperthreading enabled): the first 3 test are successes and last 7 are failure
2) disabling hyperthreading (with SOFT NUMA disabled): on 10 tests only 1 is a failure
3) disabling hyperthreading (with SOFT NUMA enabled): on 90 tests only 1 is a failure

Some elements that seems to solve the problem:
1) setting MAXDOP 1
2) recomputing statistics (UPDATE STATISTICS T_TIME_INTERVAL_TIV WITH FULLSCAN;)

Perhaps something to do with the new « Concurrent PFS updates » of SQL Server 2019 ?

========== SCRIPT ==========

-- 1) creating database
CREATE DATABASE DB_BENCH
GO

DECLARE @SQL NVARCHAR(max) = N'';
SELECT @SQL = @SQL + N'ALTER DATABASE DB_BENCH MODIFY FILE (NAME = ''' + name + N''', SIZE = 10 GB, FILEGROWTH = 64 MB);' 
FROM DB_BENCH.sys.database_files;
SET @SQL = @SQL + N'ALTER DATABASE DB_BENCH SET RECOVERY SIMPLE;'
EXEC (@SQL);
GO

USE DB_BENCH
GO

-- 2) creating table and view
CREATE TABLE T_TIME_INTERVAL_TIV
(TIV_ID       INT NOT NULL IDENTITY PRIMARY KEY,
 TIV_GROUP    INT,
 TIV_DEBUT    DATETIME2(0),
 TIV_FIN      DATETIME2(0))
GO 

CREATE VIEW V
AS
SELECT TIV_GROUP AS id, TIV_DEBUT AS intime, TIV_FIN AS outtime
FROM   T_TIME_INTERVAL_TIV
GO

-- 3) inserting datas
TRUNCATE TABLE T_TIME_INTERVAL_TIV;
GO

BULK INSERT T_TIME_INTERVAL_TIV
FROM "C:\DATA_SQL\intervals.txt"
WITH (KEEPIDENTITY , FIELDTERMINATOR = '\t',
      ROWTERMINATOR = '\n');
GO

-- 4) testing
SET NOCOUNT ON;
GO

SET STATISTICS TIME ON;
GO

WITH T1 As
(SELECT id, intime 
 FROM   V
 UNION  ALL
 SELECT id, outtime FROM V),
T2 As
(SELECT ROW_NUMBER() OVER(PARTITION BY id ORDER BY intime) NN, id, intime
 FROM   T1 T1_1),
T3 As
(SELECT T2_1.NN - ROW_NUMBER() OVER(PARTITION BY T2_1.id ORDER BY T2_1.intime,T2_2.intime) NN1,
        T2_1.id, T2_1.intime intime, T2_2.intime outtime
 FROM   T2 T2_1
        INNER JOIN T2 T2_2
              ON T2_1.id=T2_2.id
                 And T2_1.NN=T2_2.NN-1
 WHERE  EXISTS (SELECT *
                FROM   V S
                WHERE  S.id=T2_1.id
                  AND  (S.intime < T2_2.intime AND S.outtime>T2_1.intime))
    OR  T2_1.intime = T2_2.intime)
SELECT id, MIN(intime) intime, MAX(outtime) outtime
FROM   T3
GROUP  BY id, NN1
ORDER BY id, intime, outtime;

========== END OF SCRIPT ==========

Complementary test (2020-09-29)

On september 2020 the 29th, we decided to decrease the number of datafile of tempdb to 1. Setting MAXDOP = 7 on the 16 cores / 32 virtual cores (hyperthreading) the bug occurs systematically

========== PLAN COMPARISON ==========

As an information, estimated plans, compared between raw execution and after a
« UPDATE STATISTICS T_TIME_INTERVAL_TIV WITH FULLSCAN; » :