Thursday, November 21, 2019

Rapid Query Store Growth

I recently had a problem that our Query Store size growth increased dramatically. Our initial size was 512MB and this was sufficient for 2 months. Then one day Query Store flipped to Read Only mode because it ran out of space. A quick manual expansion and flipping back to Read Write fixed the issue. I was then in the same situation within a couple days. Another quick fix had it running again and then the research into the rapid change to the growth rate started.

What I discovered is that a consultant on the application side found that in his test case the query plan for the actual parameter value was more efficient so he implemented forceliterals in Dynamics. What the unintended consequence was to eliminate plan reuse. We went from 1 plan for the query to 30,000! This was why Query Store was growing so rapidly.

What I found was that we had 51,212 different query plans. Of that total 80% of the query plans were from ONLY 4 statements!


--Find the total count of stored query plans
SELECT count(*)
FROM sys.query_store_query_text t 
JOIN sys.query_store_query q ON t.query_text_id = q.query_text_id 


 --Find the count of query plans by stemmed query text (queries not reusing plans)
select  left(txt.query_sql_text,50), count(distinct Pl.query_id) QueryCount, count(distinct Pl.plan_id) PlanCount
FROM sys.query_store_plan AS Pl  
JOIN sys.query_store_query AS Qry ON Pl.query_id = Qry.query_id  
JOIN sys.query_store_query_text AS Txt ON Qry.query_text_id = Txt.query_text_id
group by left(txt.query_sql_text,50)
having count(distinct Pl.query_id) > 1000
order by count(distinct Pl.plan_id) desc;  


Once I had the problem identified then I decided to help purge the Query Store.

First I needed a way to watch the Query Store to find if it filled and flipped to Read Only and when that happened.


--verify query store is collecting data
SELECT actual_state_desc, desired_state_desc, current_storage_size_mb,   
    max_storage_size_mb, readonly_reason, interval_length_minutes,   
    stale_query_threshold_days, size_based_cleanup_mode_desc,   
    query_capture_mode_desc  
FROM sys.database_query_store_options;  


--when did query store flip to readonly --time is in UTC
select top 1 last_execution_time 
from sys.query_store_query 
order by last_execution_time desc


Next, since the 4 queries accounted for 80% of my plans I decided to purge these regularly. I am using a SQL Job purging these queries every 10 minutes as the cursor for the delete is a bit slow.

--remove specific queries that are not reusing query plans
DECLARE @query_sql_text TABLE (sql_text varchar(100), QueryCount INT, PlanCount INT);
declare @query_sql_text_state varchar(100);

--count of query plans by stemmed query text
select  left(txt.query_sql_text,50), count(distinct Pl.query_id) QueryCount, count(distinct Pl.plan_id) PlanCount
FROM sys.query_store_plan AS Pl  
JOIN sys.query_store_query AS Qry ON Pl.query_id = Qry.query_id  
JOIN sys.query_store_query_text AS Txt ON Qry.query_text_id = Txt.query_text_id
group by left(txt.query_sql_text,50)
having count(distinct Pl.query_id) > 1000
order by count(distinct Pl.plan_id) desc;  

--loop through all query plans in @query_sql_text
While (Select Count(*) From @query_sql_text) > 0
Begin

Select Top 1 @query_sql_text_state = sql_text From @query_sql_text order by sql_text asc
print @query_sql_text_state

 --remove the query_ids from query store
 DECLARE @id int
 DECLARE adhoc_queries_cursor CURSOR
 FOR
  select top 100  Pl.query_id
  FROM sys.query_store_plan AS Pl  
  JOIN sys.query_store_query AS Qry ON Pl.query_id = Qry.query_id  
  JOIN sys.query_store_query_text AS Txt ON Qry.query_text_id = Txt.query_text_id
  where left(trim(txt.query_sql_text),50) = @query_sql_text_state 
  group by Pl.query_id
  order by min(Qry.last_execution_time) asc

 OPEN adhoc_queries_cursor;
 FETCH NEXT FROM adhoc_queries_cursor INTO @id;
 WHILE @@fetch_status = 0
  BEGIN
   PRINT @id
   EXEC sp_query_store_remove_query @id
   FETCH NEXT FROM adhoc_queries_cursor INTO @id
  END
 CLOSE adhoc_queries_cursor;
 DEALLOCATE adhoc_queries_cursor;

delete Top (1) From @query_sql_text where sql_text = @query_sql_text_state

End

Now that Query Store size is better managed and I can get back to tuning the poor performing queries.

Thursday, November 14, 2019

Maximize Resumable Indexes

Resumable Indexes were a new feature introduced with SQL 2017. At first glance it didn't make sense why I would want to modify our maintenance plans to include resumable indexes because everything was running fine. This was one of those instances of we don't need resumable index rebuilds until we needed them.

We are running Microsoft Dynamics 4.0 that was released in 2007. This likely isn't the oldest software that you have seen being band-aided to stay in production, but it did not like running on SQL 2017. We had recently upgraded our database from 2008 to 2017 (that is an adventure for another day) and performance was lagging. Dynamics was making poor choices for query plans and I found that we needed to keep our index fragmentation much tighter than we previously had. Any fragmentation above 5% could potentially cause issues the query plan that Dynamics chose. Dynamics seemed to be choosing the query plans independently of SQL query engine to use in its overuse of cursors.

With a total database nearing 4TB of primarily compressed tables, tables approaching 1 billion rows, and tables with more data in indexes than data in the table; there just wasn't time in the day to rebuild all the indexes.

This is where the resumable index rebuilds came into save us. Resumable index rebuilds do just what they sound like, allow you to start a rebuild and pause it for whatever reason and then continue the rebuild later. You do not lose the time spent rebuilding indexes if you need to kill the rebuild process because of blocking against your production workload! When you combine this with the WAIT_AT_LOW_PRIORITY option you can run the rebuild and not babysit for interruptions to production!

Here is the basic code for rebuilding the index with a auto kill feature so you don't bring down production.


--Author:      Todd Palecek 
--Description: Rebuild indexes with a resumable and auto-kill method

--This will start the index rebuild and run for up to 35 minutes. If the rebuild is blocking it will
--only block for 1 minute before killing itself. 
ALTER INDEX [I_177DIMIDIDX] ON [dbo].[INVENTTRANS]
REBUILD WITH (ONLINE = ON (WAIT_AT_LOW_PRIORITY ( MAX_DURATION = 1 MINUTES, ABORT_AFTER_WAIT = SELF ))
,RESUMABLE = ON,MAX_DURATION =35 MINUTES, maxdop=8);
GO


Now if your rebuild is causing blocking and dies you will need to restart it. This frequently happens when SQL is switching in the rebuild index for me.



--resume the rebuild for another x minutes and then pause
ALTER INDEX I_366MKCCUSTSALESSTATUS ON [dbo].SALESTABLE  Resume WITH (MAX_DURATION =1 MINUTES)
GO


A nice benefit of resumable indexes that the percent complete will show up in sp_whoisactive. You can also monitor progress here.



--monitor percent complete and duration
SELECT total_execution_time, percent_complete, name,state_desc
,last_pause_time,page_count
FROM sys.index_resumable_operations;