Wednesday, March 4, 2020

Amazon Seller Central Customer Search Term Keyword Performance

Amazon has used data science and machine learning to a great benefit for themselves but has not shared those tools down to the seller to utilize. We are going analyze the monthly Sponsored Products Amazon report specifically looking at the individual keywords and which resulted in click thrus for each campaign. This will let us better analyze the performance and improve our keyword selection for future campaigns.

I'm going to give you an option to analyze the performance of your Sponsored Products using R and a Power BI front end to give your executives and other end users an easy way to interact with the information.

First login to your Amazon Seller Central Account. Go to Reports > Advertising Reports. From here select Create report and create a new report.

Once the report is create save the report locally and rename the file including the Month and Year so we can maintain our historical files. Open the report and save it as a csv in preparation for importing the file for processing in SQL and R. Close the file once done.

We are going to import the file into our Grok database. This is the database used as a repository for all our data science projects. It is built on SQL Server 2017 Enterprise Edition to take advantage of Memory Optimized tables. There are no Memory Optimized tables in this project, those are reserved for my Product Recommendation Engine. We are going to import the file as a new table to avoid errors on the import. This step really should be automated. Until we spend the time and automate this step follow these instructions:

Right click on the Grok database and select Task > Import Data > Flat File Source. Select the csv file we saved in an earlier step. Change the Text qualifier to " in the General section. On the Advanced section change output column width to 500 for all columns. Set your destination to SQL Server native Client 11.0 and select your Server name where the Grok database lives. Change the Destination table name to STGSponsoredProductsSearchTermReport.

Before we leave our database we are going to need a few tables created in order to complete processing of the data. Here are those table create scripts


USE [Grok]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[stgR_AmzSearchTerm](
 [Keyword] [varchar](255) NULL,
 [GoodFreq] [float] NULL,
 [PoorFreq] [float] NULL,
 [Good] [float] NULL,
 [Poor] [float] NULL,
 [Targeting] [varchar](255) NULL,
 [RptType] [varchar](255) NULL,
 [Mth] [varchar](255) NULL,
 [Year] [float] NULL
) ON [PRIMARY]
GO

CREATE TABLE [dbo].[R_AmzSearchTerm](
 [ID] [int] IDENTITY(1,1) NOT NULL,
 [Keyword] [varchar](255) NULL,
 [GoodFreq] [float] NULL,
 [PoorFreq] [float] NULL,
 [Good] [float] NULL,
 [Poor] [float] NULL,
 [Targeting] [varchar](255) NULL,
 [RptType] [varchar](255) NULL,
 [Mth] [varchar](255) NULL,
 [Year] [float] NULL,
 CONSTRAINT [pk_ID_AmzSearchTermID] PRIMARY KEY CLUSTERED 
(
 [ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE TABLE [dbo].[R_AmzSearchTermSummary](
 [ID] [int] IDENTITY(1,1) NOT NULL,
 [StartDate] [datetime] NULL,
 [EndDate] [datetime] NULL,
 [Campaign Name] [varchar](255) NULL,
 [Ad Group Name] [varchar](255) NULL,
 [Targeting] [varchar](255) NULL,
 [Match Type] [varchar](255) NULL,
 [CustomerSearchTerms] [int] NULL,
 [Impressions] [int] NULL,
 [Clicks] [int] NULL,
 [Mth] [varchar](12) NULL,
 [Year] [int] NULL,
 CONSTRAINT [pk_ID_AmzSearchTermSumID] PRIMARY KEY CLUSTERED 
(
 [ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO


Now that are we have our base data it is time to move into RStudio. From here create a new file called R_AmazonSearchKeywordsEncoding.R from this script. Please note that the encoding is important for this file due to special characters included in the Amazon results. To enable the encoding create the new file then select from the RStudio menu: File > Save with Encoding and chose UTF-8.

Change the variables for var.Mth and var.Year to the file being processed. Also change the cn2 and cnWrite connections to your database. Now you can run the script and the keywords for each campaign will be analyzed for click thru performance. The results will be written to our Grok database into the stgR_AmzSearchTerm table created earlier.

#Author:      Todd Palecek (toddpalecek@gmail.com)
#Description: Ingest amazon search terms and transform into good vs bad 
#keywords to be written to a database and then visualized from Power BI

library(RODBC)
library(tm,lsa)
library(ggplot2)
library(doSNOW)
library(SnowballC) #for stem
library(plyr)
library(tidytext)#library(dplyr)
library(wordcloud) #now used thru Power BI
library(stringr) #for str_trim

#variable for report name and key to write to SQL
var.RptType <- 'SponsoredProductsSearchTermReport'
var.Mth <- 'Jan'  
var.Year <- 2020

#database connections
cn2 <- odbcDriverConnect(connection="Driver={SQL Server Native Client 11.0};server=dc1-dwdb1;database=Grok;trusted_connection=yes;")
cnWrite <- odbcDriverConnect(connection="Driver={SQL Server Native Client 11.0};server=dc1-dwdb1;database=Grok;trusted_connection=yes;")

#sql data select
sqlstate <- "SELECT [Campaign Name]
      ,[Ad Group Name]
      ,trim([Targeting]) Targeting
      ,[Match Type]
      ,trim([Customer Search Term]) CustomerSearchTerm
      ,cast([Impressions] as INT) Impressions
      ,cast([Clicks] as INT) Clicks
     ,(cast([Clicks] as decimal(8,2))) /cast([Impressions] as decimal(8,2))   ClickThruRate
      ,cast([7 Day Total Orders (#)] as int) SevenDayTtlOrders
      FROM [Grok].[dbo].[STGSponsoredProductsSearchTermReport]
      where cast([Impressions] as INT) > 0"

#data
dataFetch2 <-sqlQuery(cn2, sqlstate, rows_at_time=1000, stringsAsFactors = FALSE)
close(cn2)

#cleanse data
dataFetch2$Targeting <- paste(tolower(dataFetch2$Targeting), " ")
dataFetch2$CustomerSearchTerm <- paste(tolower(dataFetch2$CustomerSearchTerm), " ")
#remove blank space at front and end
dataFetch2$Targeting <- str_trim(dataFetch2$Targeting)
dataFetch2$CustomerSearchTerm <- str_trim(dataFetch2$CustomerSearchTerm)
#remove special characters because not catching in punction remove
dataFetch2$CustomerSearchTerm <-gsub("\\“", "", dataFetch2$CustomerSearchTerm)
dataFetch2$CustomerSearchTerm <-gsub("\\’", "", dataFetch2$CustomerSearchTerm)
dataFetch2$CustomerSearchTerm <-gsub("\\â€", "", dataFetch2$CustomerSearchTerm)

#replace * because breaks my loop
dataFetch2$Targeting <-gsub("\\*", "wildcard", dataFetch2$Targeting)

#click thru rate less than 25% = Poor
dataFetch2$ClickPerf <- ifelse(dataFetch2$ClickThruRate <= .25,0, 1)

#subset for for a specific Targeting campaign
AmazonPerf <- subset(dataFetch2,Targeting=='personalized ornament')  

#replace Targeting Words in Customer Search Term with blank so do not reanalyze them
AmazonPerf$CustomerSearchTerm <- gsub("ornaments", " ", AmazonPerf$CustomerSearchTerm)
#remove punctuation, lose the + for exact term
AmazonPerf$CustomerSearchTerm <- gsub('[[:punct:]]', '', AmazonPerf$CustomerSearchTerm)
#remove the s from plural search term rather than stemming words
AmazonPerf$CustomerSearchTerm <- gsub(" s ", "", AmazonPerf$CustomerSearchTerm)
#remove blank space at front and end
AmazonPerf$CustomerSearchTerm <- str_trim(AmazonPerf$CustomerSearchTerm)

#split into good and poor performing datasets
AmazonPerfGood <- subset(AmazonPerf,ClickPerf==1)  
AmazonPerfPoor <- subset(AmazonPerf,ClickPerf==0)  


#Good performing keywords
#overall word frequency, Customer Search Term
c <- AmazonPerfGood[,5]
cGood <- c
#convert data frame to corpus
crude <- Corpus(VectorSource(as.character(c))) 
# Transform document words to lower case
crude <- tm_map(crude, content_transformer(tolower))
## Remove punctuation from documents
#crude <- tm_map(crude, removePunctuation)
# Remove stopwords from the corpus
crude <- tm_map(crude, removeWords, stopwords("english"))
crudeGood <- crude
#Word frequency
myTdm <- as.matrix(TermDocumentMatrix(crude))
FreqMat <- data.frame(ST = rownames(myTdm), 
                      Freq = rowSums(myTdm), 
                      row.names = NULL)

attach(FreqMat)
FreqMat <- FreqMat[order(Freq,decreasing=TRUE),]
detach(FreqMat)
#head(FreqMat)
FreqMatGood <- FreqMat


#poor performing keywords
#overall word frequency, Customer Search Term
c <- AmazonPerfPoor[,5]
cPoor <- c
#convert data frame to corpus
crude <- Corpus(VectorSource(as.character(c))) 
# Transform document words to lower case
crude <- tm_map(crude, content_transformer(tolower))
# Remove stopwords from the corpus
crude <- tm_map(crude, removeWords, stopwords("english"))
# Stem the corpus
#crude <- tm_map(crude, stemDocument, language = "english")
crudePoor <- crude
#Word frequency
myTdm <- as.matrix(TermDocumentMatrix(crude))
FreqMat <- data.frame(ST = rownames(myTdm), 
                      Freq = rowSums(myTdm), 
                      row.names = NULL)

attach(FreqMat)
FreqMat <- FreqMat[order(Freq,decreasing=TRUE),]
detach(FreqMat)
#head(FreqMat)
FreqMatPoor <- FreqMat

FreqComp <- merge(x=FreqMatGood, y=FreqMatPoor, by="ST", all = TRUE)
FreqComp[is.na(FreqComp$Freq.x),"Freq.x"] <- 0
FreqComp[is.na(FreqComp$Freq.y),"Freq.y"] <- 0

FreqComp$ClickPerf <- ifelse(FreqComp$Freq.x == FreqComp$Freq.y, 1, ifelse(FreqComp$Freq.x > FreqComp$Freq.y, 1, 0))
FreqComp$FreqGood <- FreqComp$Freq.x - FreqComp$Freq.y
FreqComp$FreqPoor <- FreqComp$Freq.y - FreqComp$Freq.x
FreqCompGood <- subset(FreqComp,ClickPerf==1)  
FreqCompPoor <- subset(FreqComp,ClickPerf==0)  


attach(FreqComp)
FreqComp <- FreqComp[order(ST,decreasing=FALSE),]
detach(FreqComp)

colnames(FreqComp) = c("Keyword","GoodFreq","PoorFreq","ClickPerf","Good", "Poor")


#extract unique values of SKUs and Order Methods to loop through later
sku_list <- unique(dataFetch2$Targeting)

#create blank data frame to populated relationships into
rulestoDBALL <- data.frame()

#loop through each SKU, Brand, and Order Method to build relationships
for (i in 1:length(sku_list)){
  AmazonPerf <- dataFetch2[which(dataFetch2$Targeting==sku_list[i]),]
 
  #replace Targeting Words in Customer Search Term with blank so do not reanalyze them
  #split string of words
  s <- (sku_list[i])
  #strsplit(s, split = " ")
  #create lookup table with words to replace
  l <- strsplit(s, split = " ")
  list_gsub <- data.frame(matrix(unlist(l), nrow=length(l), byrow=T))
  
  #loop through each lookup row
  for(x in 1:nrow(list_gsub)){
    AmazonPerf$CustomerSearchTerm <- gsub(list_gsub[x]," ", AmazonPerf$CustomerSearchTerm)
  }
  
  #old replace Targeting Words in Customer Search Term with blank so do not reanalyze them
  #remove punctuation, lose the + for exact term
  AmazonPerf$CustomerSearchTerm <- gsub('[[:punct:]]', '', AmazonPerf$CustomerSearchTerm)
  #remove the s from plural search term rather than stemming words
  AmazonPerf$CustomerSearchTerm <- gsub(" s ", "", AmazonPerf$CustomerSearchTerm)
  #remove blank space at front and end
  AmazonPerf$CustomerSearchTerm <- str_trim(AmazonPerf$CustomerSearchTerm)
  
  #split into good and poor performing datasets
  AmazonPerfGood <- subset(AmazonPerf,ClickPerf==1)  
  AmazonPerfPoor <- subset(AmazonPerf,ClickPerf==0)  
  
  
  #Good performing keywords
  #overall word frequency, Customer Search Term
  c <- AmazonPerfGood[,5]
  cGood <- c
  #convert data frame to corpus
  crude <- Corpus(VectorSource(as.character(c))) 
  # Remove stopwords from the corpus
  crude <- tm_map(crude, removeWords, stopwords("english"))

  crudeGood <- crude
  #Word frequency
  myTdm <- as.matrix(TermDocumentMatrix(crude))
  FreqMat <- data.frame(ST = rownames(myTdm), 
                        Freq = rowSums(myTdm), 
                        row.names = NULL)
  
  attach(FreqMat)
  FreqMat <- FreqMat[order(Freq,decreasing=TRUE),]
  detach(FreqMat)
  FreqMatGood <- FreqMat
  
  if(class(FreqMatGood)=="numeric") {
    FreqMatGood <- NULL
    FreqMatGood <- data.frame(ST=character(),
                              Freq=integer())
  }
  
  FreqMat <- NULL
  #poor performing keywords
  #overall word frequency, Customer Search Term
  c <- AmazonPerfPoor[,5]
  cPoor <- c
  #convert data frame to corpus
  crude <- Corpus(VectorSource(as.character(c))) 
  # Remove stopwords from the corpus
  crude <- tm_map(crude, removeWords, stopwords("english"))

  crudePoor <- crude
  #Word frequency
  myTdm <- as.matrix(TermDocumentMatrix(crude))
  FreqMat <- data.frame(ST = rownames(myTdm), 
                        Freq = rowSums(myTdm), 
                        row.names = NULL)
  
  attach(FreqMat)
  FreqMat <- FreqMat[order(Freq,decreasing=TRUE),]
  detach(FreqMat)
  FreqMatPoor <- FreqMat
  
  #check if FreqMatPoor datafram is empty and populate with values of zero
  if(class(FreqMatPoor)=="numeric") {
    FreqMatPoor <- NULL
    FreqMatPoor <- data.frame(ST=character(),
                              Freq=integer())
  }
  
  
  FreqComp <- merge(x=FreqMatGood, y=FreqMatPoor, by="ST", all = TRUE)
  FreqComp[is.na(FreqComp$Freq.x),"Freq.x"] <- 0
  FreqComp[is.na(FreqComp$Freq.y),"Freq.y"] <- 0
  FreqComp$FreqGood <- FreqComp$Freq.x - FreqComp$Freq.y
  FreqComp$FreqPoor <- FreqComp$Freq.y - FreqComp$Freq.x

  attach(FreqComp)
  FreqComp <- FreqComp[order(ST,decreasing=FALSE),]
  detach(FreqComp)
  
  colnames(FreqComp) = c("Keyword","GoodFreq","PoorFreq","Good", "Poor")
  
  
  #convert rules to data frame in prep to write to DB
  rulestoDB <- FreqComp
  rulestoDB$Targeting <- (sku_list[i]) 
  rulestoDBALL <- rbind(rulestoDBALL,rulestoDB)
  #clear data sets
  l <- NULL
  s <- NULL
  crude <- NULL
  crudeGood <- NULL
  crudePoor <- NULL
  FreqComp <- NULL
  FreqMatGood <- NULL
  FreqMatPoor <- NULL
  rulestoDB <- NULL
}

rulestoDBALL$RptType <- var.RptType
rulestoDBALL$Mth <- var.Mth
rulestoDBALL$Year <- var.Year

#export data to database
sqlSave(cnWrite,rulestoDBALL,tablename = "stgR_AmzSearchTerm",rownames=FALSE,append = TRUE, fast = FALSE )



Now we can jump back into our database and move the Stage results into our permanent historical table, R_AmzSearchTerm.


--Insert Search Term Details
INSERT INTO [Grok].[dbo].[R_AmzSearchTerm]
           ([Keyword],[GoodFreq],[PoorFreq],[Good],[Poor],[Targeting],[RptType],[Mth],[Year])
  SELECT [Keyword],[GoodFreq],[PoorFreq],[Good],[Poor],[Targeting],[RptType],[Mth],[Year]
  FROM [Grok].[dbo].[stgR_AmzSearchTerm]

--Change the Month and Year before running
--Insert Campaign Level Summary
INSERT INTO [dbo].[R_AmzSearchTermSummary]
           ([StartDate],[EndDate],[Campaign Name],[Ad Group Name],[Targeting],[Match Type],[CustomerSearchTerms]
           ,[Impressions],[Clicks],[Mth],[Year])   
   SELECT  min([Start Date]) StartDate,max([End Date]) EndDate
      ,[Campaign Name],[Ad Group Name],[Targeting],[Match Type]
      ,count([Customer Search Term]) CustomerSearchTerms
      ,sum(cast([Impressions] as INT)) Impressions
      ,sum(cast([Clicks] as INT)) Clicks
   ,'Jan'
   ,2020
  FROM [Grok].[dbo].STGSponsoredProductsSearchTermReport 
  group by [Campaign Name],[Ad Group Name],[Targeting],[Match Type]


And lets cleanup our temp tables for our next run:


truncate table [Grok].[dbo].[stgR_AmzSearchTerm]
GO
drop table [Grok].[dbo].[STGSponsoredProductsSearchTermReport]
GO


Now we can take a look at our Power BI report that we are going to share with our end users and executives to show them how data science can be used to improve decision making.

The report can be downloaded from here:
AmazonSearchTerms

I am using Power BI Desktop which if free as well as the online portal to share out the report. The desktop version is great because anyone can install it. Sharing of files becomes a bit trickier but is manageable.

The report is broken up into 5 sections.

Keyword Performance - Lists each individual and how it performed. Performance column is the difference between a Good Count (Click Thru rate greater than 25%) and a Poor Count (Click Thru rate less than 25%).

Targeting Campaign Summary - Summary of each campaign. Customer Search Terms is the count of unique search terms that were used. Impressions are time shown to a customer. Clicks are times with a Click Thru. Click Thru Rate % is the percent of Clicks compared to Impressions.

Targeting Keyword Detail - All Keywords as columns and Targeting Campaigns as rows. Becomes useful once limit to a single campaign

Good Keywords - Wordcloud of keywords generating greater than 25% click through.

Poor Keywords - Wordcloud of keywords generating less than 25% click through.

I like to use the report by selecting a keyword to see wich Targeting Campaigns it was used in. I then select the Targeting Campaign to see how different words performed within that campaign.

Thursday, November 21, 2019

Rapid Query Store Growth

I recently had a problem that our Query Store size growth increased dramatically. Our initial size was 512MB and this was sufficient for 2 months. Then one day Query Store flipped to Read Only mode because it ran out of space. A quick manual expansion and flipping back to Read Write fixed the issue. I was then in the same situation within a couple days. Another quick fix had it running again and then the research into the rapid change to the growth rate started.

What I discovered is that a consultant on the application side found that in his test case the query plan for the actual parameter value was more efficient so he implemented forceliterals in Dynamics. What the unintended consequence was to eliminate plan reuse. We went from 1 plan for the query to 30,000! This was why Query Store was growing so rapidly.

What I found was that we had 51,212 different query plans. Of that total 80% of the query plans were from ONLY 4 statements!


--Find the total count of stored query plans
SELECT count(*)
FROM sys.query_store_query_text t 
JOIN sys.query_store_query q ON t.query_text_id = q.query_text_id 


 --Find the count of query plans by stemmed query text (queries not reusing plans)
select  left(txt.query_sql_text,50), count(distinct Pl.query_id) QueryCount, count(distinct Pl.plan_id) PlanCount
FROM sys.query_store_plan AS Pl  
JOIN sys.query_store_query AS Qry ON Pl.query_id = Qry.query_id  
JOIN sys.query_store_query_text AS Txt ON Qry.query_text_id = Txt.query_text_id
group by left(txt.query_sql_text,50)
having count(distinct Pl.query_id) > 1000
order by count(distinct Pl.plan_id) desc;  


Once I had the problem identified then I decided to help purge the Query Store.

First I needed a way to watch the Query Store to find if it filled and flipped to Read Only and when that happened.


--verify query store is collecting data
SELECT actual_state_desc, desired_state_desc, current_storage_size_mb,   
    max_storage_size_mb, readonly_reason, interval_length_minutes,   
    stale_query_threshold_days, size_based_cleanup_mode_desc,   
    query_capture_mode_desc  
FROM sys.database_query_store_options;  


--when did query store flip to readonly --time is in UTC
select top 1 last_execution_time 
from sys.query_store_query 
order by last_execution_time desc


Next, since the 4 queries accounted for 80% of my plans I decided to purge these regularly. I am using a SQL Job purging these queries every 10 minutes as the cursor for the delete is a bit slow.

--remove specific queries that are not reusing query plans
DECLARE @query_sql_text TABLE (sql_text varchar(100), QueryCount INT, PlanCount INT);
declare @query_sql_text_state varchar(100);

--count of query plans by stemmed query text
select  left(txt.query_sql_text,50), count(distinct Pl.query_id) QueryCount, count(distinct Pl.plan_id) PlanCount
FROM sys.query_store_plan AS Pl  
JOIN sys.query_store_query AS Qry ON Pl.query_id = Qry.query_id  
JOIN sys.query_store_query_text AS Txt ON Qry.query_text_id = Txt.query_text_id
group by left(txt.query_sql_text,50)
having count(distinct Pl.query_id) > 1000
order by count(distinct Pl.plan_id) desc;  

--loop through all query plans in @query_sql_text
While (Select Count(*) From @query_sql_text) > 0
Begin

Select Top 1 @query_sql_text_state = sql_text From @query_sql_text order by sql_text asc
print @query_sql_text_state

 --remove the query_ids from query store
 DECLARE @id int
 DECLARE adhoc_queries_cursor CURSOR
 FOR
  select top 100  Pl.query_id
  FROM sys.query_store_plan AS Pl  
  JOIN sys.query_store_query AS Qry ON Pl.query_id = Qry.query_id  
  JOIN sys.query_store_query_text AS Txt ON Qry.query_text_id = Txt.query_text_id
  where left(trim(txt.query_sql_text),50) = @query_sql_text_state 
  group by Pl.query_id
  order by min(Qry.last_execution_time) asc

 OPEN adhoc_queries_cursor;
 FETCH NEXT FROM adhoc_queries_cursor INTO @id;
 WHILE @@fetch_status = 0
  BEGIN
   PRINT @id
   EXEC sp_query_store_remove_query @id
   FETCH NEXT FROM adhoc_queries_cursor INTO @id
  END
 CLOSE adhoc_queries_cursor;
 DEALLOCATE adhoc_queries_cursor;

delete Top (1) From @query_sql_text where sql_text = @query_sql_text_state

End

Now that Query Store size is better managed and I can get back to tuning the poor performing queries.

Thursday, November 14, 2019

Maximize Resumable Indexes

Resumable Indexes were a new feature introduced with SQL 2017. At first glance it didn't make sense why I would want to modify our maintenance plans to include resumable indexes because everything was running fine. This was one of those instances of we don't need resumable index rebuilds until we needed them.

We are running Microsoft Dynamics 4.0 that was released in 2007. This likely isn't the oldest software that you have seen being band-aided to stay in production, but it did not like running on SQL 2017. We had recently upgraded our database from 2008 to 2017 (that is an adventure for another day) and performance was lagging. Dynamics was making poor choices for query plans and I found that we needed to keep our index fragmentation much tighter than we previously had. Any fragmentation above 5% could potentially cause issues the query plan that Dynamics chose. Dynamics seemed to be choosing the query plans independently of SQL query engine to use in its overuse of cursors.

With a total database nearing 4TB of primarily compressed tables, tables approaching 1 billion rows, and tables with more data in indexes than data in the table; there just wasn't time in the day to rebuild all the indexes.

This is where the resumable index rebuilds came into save us. Resumable index rebuilds do just what they sound like, allow you to start a rebuild and pause it for whatever reason and then continue the rebuild later. You do not lose the time spent rebuilding indexes if you need to kill the rebuild process because of blocking against your production workload! When you combine this with the WAIT_AT_LOW_PRIORITY option you can run the rebuild and not babysit for interruptions to production!

Here is the basic code for rebuilding the index with a auto kill feature so you don't bring down production.


--Author:      Todd Palecek 
--Description: Rebuild indexes with a resumable and auto-kill method

--This will start the index rebuild and run for up to 35 minutes. If the rebuild is blocking it will
--only block for 1 minute before killing itself. 
ALTER INDEX [I_177DIMIDIDX] ON [dbo].[INVENTTRANS]
REBUILD WITH (ONLINE = ON (WAIT_AT_LOW_PRIORITY ( MAX_DURATION = 1 MINUTES, ABORT_AFTER_WAIT = SELF ))
,RESUMABLE = ON,MAX_DURATION =35 MINUTES, maxdop=8);
GO


Now if your rebuild is causing blocking and dies you will need to restart it. This frequently happens when SQL is switching in the rebuild index for me.



--resume the rebuild for another x minutes and then pause
ALTER INDEX I_366MKCCUSTSALESSTATUS ON [dbo].SALESTABLE  Resume WITH (MAX_DURATION =1 MINUTES)
GO


A nice benefit of resumable indexes that the percent complete will show up in sp_whoisactive. You can also monitor progress here.



--monitor percent complete and duration
SELECT total_execution_time, percent_complete, name,state_desc
,last_pause_time,page_count
FROM sys.index_resumable_operations;

Friday, October 18, 2019

R Market Basket Analysis for Product Recommendations

After attending the Data Science Dojo boot camp I was looking for projects to give a quick return on the investment in training. Since there are reports that Amazon generates about 30% of its revenue from product recommendations I followed their lead and built our own recommendation engine.

This is the first piece to pull in the data and measure the relationships and write it back out to your database. The supporting pieces will be coming later on building the database and query to return the recommended products to use. Also coming up is some website scraping in Python to top selling products from others and how to include them to boost your recommendations.


#Author:      Todd Palecek (toddpalecek@gmail.com)
#Description: Determine Product Affinity from a Market Basket
#             using Apriori to feed a SQL Recommendation Engine


library(RODBC)
library(stats)
library(reshape2)
library(arules)
library(arulesViz)
library(config)


#Definitions:
#Support: The fraction of which our item set occurs in our dataset.
#Confidence: probability that a rule is correct for a new transaction with items on the left.
#Lift: The ratio by which by the confidence of a rule exceeds the expected confidence. 
#Note: if the lift is 1 it indicates that the items on the left and right are independent.


#variable for report name and key to write to SQL
var.RptType <- 'MktBskt-60'
var.Link <- 'DC1-DWDB1.Silvermine.dbo.DimProduct.SKU'  


#memory cleanup - may not be needed but can't hurt
gc()

#check and increase memory limit
memory.limit()
memory.limit(size = 26000) #if lower than currently set will not decrease memory available


#import from SQL
cn2 <- odbcDriverConnect(connection="Driver={SQL Server Native Client 11.0};server=YourServer;database=DataWarehouse;trusted_connection=yes;")
cnWrite <- odbcDriverConnect(connection="Driver={SQL Server Native Client 11.0};server=YourServer;database=Grok;trusted_connection=yes;case=nochange")


#Import data to R from SQL Data Warehouse
#To include text in SQL statement need to format as \'text\'
#If using your data warehouse as source date is likely an integer
#Only including orders that have multiple items to limit data imported to only useful data
dataFetch2 <-sqlQuery(cn2, 'with OrderItems (SKU, OrderID,BrandKey,OrderMethodKey)
as
                      (SELECT distinct P.SKU, D.OrderID, D.BrandKey,D.OrderMethodKey
                      FROM [Silvermine].[ssas].[FactDemand] D 
                      join [Silvermine].[ssas].[DimProduct] P on D.ProductKey = P.ProductKey 
                      where  D.OrderReceivedDateKey between (SELECT YEAR(GETDATE()-60) * 10000 + MONTH(GETDATE()-60) * 100 + DAY(GETDATE()-60)) 
                      and (SELECT YEAR(GETDATE()-1) * 10000 + MONTH(GETDATE()-1) * 100 + DAY(GETDATE()-1))
                      )
                      select O.OrderID, O.SKU, O.BrandKey, O.OrderMethodKey
                      from OrderItems O
                      where O.BrandKey in (1) and O.OrderID in (select OrderID from OrderItems group by OrderID having COUNT(SKU) > 1)', rows_at_time=1000)


#close connection after not needed to free resources
close(cn2)

#memory cleanup - may not be needed but can't hurt
gc()


#copy to new dataframe for modifications
dataFetch3 <- dataFetch2


#extract unique values of SKUs and Order Methods to loop through later
sku_list <- unique(dataFetch3$BrandKey)
ordermethod_list <- unique(dataFetch3$OrderMethodKey)


#create blank data frame to populated relationships into
rulestoDBALL <- data.frame()


#loop through each SKU, Brand, and Order Method to build relationships
for (i in 1:length(sku_list)){
  dataFetch2 <- dataFetch3[which(dataFetch3$BrandKey==sku_list[i]),]
  dataFetch2$BrandKey <- NULL
  for (j in 1:length(ordermethod_list)){
    dataFetch4 <- dataFetch2[which(dataFetch2$OrderMethodKey==ordermethod_list[j]),]
    dataFetch4$OrderMethodKey <- NULL
    t <- dcast(dataFetch4, OrderID ~ SKU, drop = FALSE) #added drop = FALSE becasue of errors
    t[is.na(t)] <- 0
    
    gc()
    
    t1 <- as(t[,-1]>0,"transactions")
    
    #Get the rules, maxlen sets number of items in set - default is 10 - the more returned the 
    #more resource intenstive analysis will be. maxlen = 2 returns the top related products that
    #we will recommend and speeds processing of data
    rules <- apriori(t1, parameter = list(supp = 0.00001, conf = 0.05,maxlen=2))
    
    #Show only 2 digits
    options(digits=2)
    
    #convert rules to data frame in prep to write to DB
    rulestoDB <- as(rules, "data.frame");
    rulestoDB$brandkey <- (sku_list[i]) 
    rulestoDB$Link <- var.Link 
    rulestoDB$OrderMethod <- (ordermethod_list[j])
    rulestoDBALL <- rbind(rulestoDBALL,rulestoDB)
  }
}


#optional: review the rules created
inspect(rules[1:5])


#variable for report name and key to write to SQL
rulestoDBALL$AnalysisType <- print(var.RptType) 
rulestoDBALL$AnalysisDate <- as.Date(Sys.Date())
rulestoDBALL$ReportType <- print(var.RptType) 
rulestoDBALL$ReportDate <- as.Date(Sys.Date())


#Write data to SQL table. Further analysis is done within SQL.
sqlSave(cnWrite,rulestoDBALL,tablename = "stgR_MarketBasket",rownames=FALSE,append = TRUE, fast = FALSE )


#further reading
#http://www.salemmarafi.com/code/market-basket-analysis-with-r/

#Supporting SQL Tables
# USE [Grok]
# GO
# 
# SET ANSI_NULLS ON
# GO
# 
# SET QUOTED_IDENTIFIER ON
# GO
# 
# CREATE TABLE [dbo].[stgR_MarketBasket](
#   [rules] [varchar](250) NULL,
#   [support] [decimal](8, 7) NULL,
#   [confidence] [decimal](5, 4) NULL,
#   [lift] [int] NULL,
#   [brandkey] [int] NULL,
#   [Link] [varchar](250) NULL,
#   [OrderMethod] [int] NULL,
#   [AnalysisType] [varchar](50) NULL,
#   [AnalysisDate] [date] NULL,
#   [ReportType] [varchar](50) NULL,
#   [ReportDate] [date] NULL
# ) ON [PRIMARY]
# GO

Thursday, August 23, 2018

SQL 2017 R Insert Into Database Table

I've recently been down the data science path and integrating R and some measures into our analysis stack.

Moving from R Studio to SQL added some interesting problems to be able to get my scripts to work. The first few problems I ran into were structure, security, and then field types.

I overcame all those problems and came up with a script to calculate the slope of our sales over the previous x days and then use SQL 2017 R to write data to a database table.

I did not use the "WITH RESULT SETS" for the output and rather inserted my results into a staging table. The staging table was necessary because of type float and varchar coming from the R script that I was unable to coerce into the correct format.

One thing to remember as you move to SQL R is that the data source is imported separately and the default name is "InputDataSet" when you use it in your script.

And here is my script to loop through all products and calculate the slope of sale quantity.



INSERT INTO [stgR_QtySlope] ([Key]
           ,[elasticity_RedRetail]
           ,[elasticity_DemandMerchSalePrice]
           ,[ttl_sales]
           ,[max_price]
           ,[min_price]
           ,[mean_price]
           ,[price_ct]
           ,[start_dt]
           ,[end_dt]
           ,[slope]
           ,[orderdays_ct]
           ,[brandkey]
           ,[Link]
           ,[AnalysisType]
           ,[AnalysisDate]
           ,[ReportType]
           ,[ReportDate])

execute sp_execute_external_script   
  @language = N'R'   
  ,@script = N'library(RODBC)
library(doParallel)
library(stats)
dataFetch2 <-InputDataSet
b <- dataFetch2
b$ProductKey <- trimws(b$ProductKey)
b$PerUnitPrice = b$RedRetailPrice  #used to calculate elasticity
sku_list <- unique(trimws(b$ProductKey))
b$FreeShipping <- NULL
b$OrderDays <- NULL
b$RedRetailPrice <- NULL
b$SaleQtyPerDay <- NULL
b$BrandKey <- NULL

elasticity_values <- data.frame()

for (i in 1:length(sku_list)){
  test_sku2 <- b[which(b$ProductKey==sku_list[i]),]
  #elasticity
  m2 <- lm(formula=SaleQty~PerUnitPrice, data= test_sku2)
  coeffs2 <- as.data.frame(m2[[1]])
  gradient2<- coeffs2[2,1]
  gradient_final2 <- ifelse(is.na(gradient2), 0, gradient2)
  mean_price <- sum(test_sku2$SaleQty * test_sku2$PerUnitPrice)/sum(test_sku2$SaleQty) #mean(test_sku2$PerUnitPrice)
  max.price <- max(test_sku2$PerUnitPrice)
  min.price <- min(test_sku2$PerUnitPrice)
  price.ct <- length(unique(test_sku2$PerUnitPrice))
  #elasticity DemandMerch sale price
  m3 <- lm(formula=SaleQty~DemandMerchSellingPrice, data= test_sku2)
  coeffs3 <- as.data.frame(m3[[1]])
  gradient3<- coeffs3[2,1]
  gradient_final3 <- ifelse(is.na(gradient2), 0, gradient3)
  mean_price2 <- sum(test_sku2$SaleQty * test_sku2$DemandMerchSellingPrice)/sum(test_sku2$SaleQty) #mean(test_sku2$PerUnitPrice)
  max.price2 <- max(test_sku2$DemandMerchSellingPrice)
  min.price2 <- min(test_sku2$DemandMerchSellingPrice)
  price.ct2 <- length(unique(test_sku2$DemandMerchSellingPrice))
  #combine to single day ignoring price for sales slope
  test_sku2$DemandMerchSellingPrice <- NULL
  test_sku2 <- aggregate(test_sku2$SaleQty, by=list(test_sku2$ProductKey,test_sku2$OrderReceivedDateKey),FUN=sum)
  names(test_sku2) <- c("ProductKey", "OrderReceivedDateKey","SaleQty")
  attach(test_sku2)
  test_sku2 <- test_sku2[order(OrderReceivedDateKey),]
  detach(test_sku2)
  test_sku2$ID <- seq.int(nrow(test_sku2))
  #sales slope
  m1 <- lm(formula=SaleQty~ID, data= test_sku2)
  coeffs <- as.data.frame(m1[[1]])
  gradient<- coeffs[2,1]
  gradient_final <- ifelse(is.na(gradient), 0, gradient)
  mean_sales <- mean(test_sku2$SaleQty)
  ttl.sales <- sum(test_sku2$SaleQty)
  orderdays.ct <- length(unique(test_sku2$OrderReceivedDateKey))
  brandkey <- 100
  end_dt <- max(test_sku2$OrderReceivedDateKey)
  start_dt <- min(test_sku2$OrderReceivedDateKey)
  slope <- gradient_final
  elasticity <- gradient_final2*mean_price/mean_sales
  elasticity2 <- gradient_final3*mean_price2/mean_sales
  sku_elasticity <- cbind(sku_list[i],as.numeric(elasticity),as.numeric(elasticity2),as.numeric(ttl.sales),as.numeric(max.price),as.numeric(min.price),as.numeric(mean_price),price.ct,start_dt,end_dt,as.numeric(slope),orderdays.ct,brandkey)
  elasticity_values <- rbind(elasticity_values,sku_elasticity)
}
colnames(elasticity_values)[colnames(elasticity_values)=="V1"] <- "ProductKey"
elasticity_values$Link <- "ProductKey"
elasticity_values$AnalysisType <- "Slope-7"
elasticity_values$AnalysisDate <- as.Date(Sys.Date())
elasticity_values$ReportType <- "Slope-7"
elasticity_values$ReportDate <- as.Date(Sys.Date())
OutputDataSet <- elasticity_values;'  
 ,@input_data_1 = N'With PriceElasticty (ProductKey,OrderReceivedDateKey, BrandKey, SaleQty,BlackRetailPrice,RedRetailPrice,DemandItemSoldPrice,
                       DemandMerchSalesTtl, DemandMerchSalesWFeeAndShipTtl, FreeShipping)
                      as
                      (
                      SELECT  D.[ProductKey], D.[OrderReceivedDateKey], D.[BrandKey], D.[SaleQty], D.BlackRetailPrice
                      ,case when D.RedRetailPrice = 0 then D.BlackRetailPrice else D.RedRetailPrice end RedRetailPrice
                      ,round(D.SaleAmt/D.SaleQty,2) DemandItemSoldPrice
                      ,round((D.SaleAmt - D.DiscountAmt + D.MiscAmt),2) DemandMerchSalesTtl
                      ,round((D.SaleAmt - D.DiscountAmt + D.MiscAmt+ProcessingFeeAmt+ShippingAmt),2) DemandMerchSalesWFeeAndShipTtl
                      ,case when ShippingAmt = 0 then 1 else 0 end FreeShipping
                      FROM [Demand] D
                      inner join [Product] P on D.ProductKey = P.ProductKey
                      where  D.OrderReceivedDateKey between (SELECT YEAR(GETDATE()-8) * 10000 + MONTH(GETDATE()-8) * 100 + DAY(GETDATE()-8))
                      and (SELECT YEAR(GETDATE()-1) * 10000 + MONTH(GETDATE()-1) * 100 + DAY(GETDATE()-1))
                      and D.IsCanceled = 0)
                      select ProductKey,FreeShipping, RedRetailPrice, BrandKey, OrderReceivedDateKey
                      ,count(distinct OrderReceivedDateKey) OrderDays
                      ,sum(SaleQty) SaleQty
                      ,cast(sum(SaleQty) as decimal(8,2))/cast(count(distinct OrderReceivedDateKey) as decimal(8,2)) SaleQtyPerDay
                      ,sum(DemandMerchSalesTtl)/sum(SaleQty) DemandMerchSellingPrice
                      from PriceElasticty
                      group by ProductKey, FreeShipping, RedRetailPrice, BrandKey,OrderReceivedDateKey
                      order by  FreeShipping, ProductKey, RedRetailPrice, BrandKey;'   

--move from STG to Permenant table
INSERT [R_QtySlope]
           ([Key]
           ,[elasticity_RedRetail]
           ,[elasticity_DemandMerchSalePrice]
           ,[ttl_sales]
           ,[max_price]
           ,[min_price]
           ,[mean_price]
           ,[price_ct]
           ,[start_dt]
           ,[end_dt]
           ,[slope]
           ,[orderdays_ct]
           ,[brandkey]
           ,[Link]
           ,[AnalysisType]
           ,[AnalysisDate]
           ,[ReportType]
           ,[ReportDate])
SELECT  [Key]
      ,cast(left([elasticity_RedRetail],8) as  [decimal](8, 2))
      ,cast(left([elasticity_DemandMerchSalePrice],8) as [decimal](8, 2))
      ,cast(left([ttl_sales],9) as  [decimal](9, 2))
      ,cast(left([max_price],7) as [decimal](7, 2))
      ,cast(left([min_price],7) as [decimal](7, 2))
      ,cast(left([mean_price],7) as [decimal](7, 2))
      ,[price_ct]
      ,[start_dt]
      ,[end_dt]
      ,cast(left([slope],7) as [decimal](7, 4))
      ,[orderdays_ct]
      ,[brandkey]
      ,[Link]
      ,[AnalysisType]
      ,getdate()
      ,[ReportType]
      ,getdate()
  FROM [stgR_QtySlope]


Tuesday, March 21, 2017

Microsoft Performance Dashboard Reports for AWS RDS

We have recently made our first foray into AWS for running SQL server. My first impression was, "What happened to my sa account!?!?".

After getting over the shock of not having full rights to perform monitoring and maintenance of the server I started installing our standard tools to see which would work. By putting the stored procs in our DBAWORK database I was able to get everything running except for seeing the detail on running processes.

I then remembered and excellent tool that I have used before, Microsoft SQL Server 2012 Performance Dashboard Reports.

Now I am excited, I can dig into the performance of the server and be able to share my findings with the rest of the team with pretty graphs and tables. I run the install with repointing the stored procs to DBAWORK, open the base report, and NOTHING.

The problem is that the install scripts and reports all use msdb as their install database. A quick review of the code and report definitions shows that it should be simple to change the database and get everything working.

To save you the suspense, it worked. If you want to add a DBAWORK database you can use my modified scripts and reports as is. If you want to customize for your needs replace DBAWORK in all the attached files with your preferred database.


Monday, November 7, 2016

Change Power BI Report from DirectQuery to Import

I have a report that I wanted to quickly publish to share with our CEO as a Power BI dashboard. The problem is that we do not have a gateway built yet to give him access to the data. I could not find a easy way to change the report from DirectQuery to Import before publishing the report so he could view and interact with the data. I found a way around this limitation this way.

1. Go to the Edit Queries section.
2. Select any of your columns, Right Click, and select Replace Values.
3. Enter any values in the Value To Find and Replace With boxes and click OK.
4. You will then see a Yellow Bar across top reading "This step results in a query that is no supported in DirectQuery mode." Click on the Connection Settings and Change to Import.
5. After making the change then deleted the Replaced Value in the Applied Steps section on the right of screen.

You now have converted you report to an Import from Direct Query so it can be Published and shared without setting up the Gateway.