COHORT RETENTION METHOD USING SQL

3 min readNov 2, 2022

Hi, all data enthusiasts. Today I am going to show you an SQL data analysis using the Cohort retention method for online retailers.

Before Starting let us know what is Cohort Analysis.

Definition -

A Cohort Analysis is an analysis of several diff cohorts to get a better understanding of behaviors, patterns, and trends of a group of customers —

In Cohort Based Analysis We have 3 types -

Time-Based Cohort —

Size-Based Cohort —

Segment-Based Cohort —

It is simply a group of people with something in common

— We use this analysis to understand the behavior of customers. We can see patterns and trends in the group.

Let us jump to coding

--CLEANING DATA---- TOTAL RECORD -541909---- NO OF NULL IN CUSTOMER ID ==135080---- NO OF ROW WITH NO CUSTOMER ID = 406829--; with online_retail as ---CTE FUNCTION--(SELECT [InvoiceNo],[StockCode],[Description],[Quantity],[InvoiceDate],[UnitPrice],[CustomerID],[Country]FROM [PORTFOLIO PROJECT].[dbo].[Online_Retails]--WHERE CustomerID = 0WHERE CustomerID !=0),with_qty_unitprice as --CTE FUNCTION--(select * from online_retailwhere Quantity >0 and UnitPrice >0),-- RECORD WITH QTY AND UNIT PRICE APART FROM ZERO VALUE -  397882 ---- CHECKING FOR DUPLICATION using row number and partition by method--dup_check as(select * , ROW_NUMBER() over(partition by InvoiceNo,StockCode,Quantity order by InvoiceDate)dup_flagfrom with_qty_unitprice)select *into #online_retail_main from dup_check --temp table method--where dup_flag = 1 -- we have nonduplicate row of 392667---- let us see duplicate data----where dup_flag>1 -- we have total 5215 duplicate record--select * from#online_retail_main --Cleaned data --

We need a Unique identifier for Cohort analysis-

In this case will be CustomerID

We need the Initial Start date which will be InvoiceDate

We need Revenue data

Creating Cohort group

Select CustomerID, min(InvoiceDate) as first_purchase_date,DATEFROMPARTS(year(min(InvoiceDate)), MONTH(min(InvoiceDate)),1) as Cohort_Dateinto #cohort --temp table--from #online_retail_maingroup by CustomerIDselect *from #cohort

This Dataset is based on Time Based cohort —

Where we are grouping Customer based purchase dates for the first time and their behavior after their first purchase.

We can use Size based cohort where we can find patterns based on qty and grouping into buckets like smaller, medium, and higher patches.

But let’s focus on Time based cohort for now.

Cohort Index — It’s an integer representation of no of months passed since the customer’s first purchase.

Let’s the difference between month and year in the cohort index.

selectmdd.*,cohort_index = year_diff*12+mont_diff+1into #cohort_retentionfrom(selectmd.*,year_diff = Invoice_year-cohort_year,mont_diff = Invoice_month-cohort_monthfrom(selectm.*,c.Cohort_Date,YEAR(m.InvoiceDate) as Invoice_year,MONTH(m.InvoiceDate) as Invoice_month,YEAR(c.Cohort_Date) as cohort_year,MONTH(c.Cohort_Date) as cohort_monthfrom #online_retail_main as mleft join #cohort as con m.CustomerID=c.CustomerID) as md)mddselect *from #cohort_retentioselect *into #cohort_pivotfrom(select distinctCustomerID,Cohort_Date,cohort_indexfrom #cohort_retention) as tblpivot (count(CustomerID)for cohort_index in([1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13])) as pivot_table-- the below query are shown in Percentage-select Cohort_Date,1.0*[1]/[1]*100 as[1],1.0*[2]/[1] *100 as [2],1.0*[3]/[1]*100 as [3],1.0*[4]/[1]*100 as [4],1.0*[5]/[1]*100 as [5],1.0*[6]/[1]*100 as [6],1.0*[7]/[1]*100 as [7],1.0*[8]/[1]*100 as [8],1.0*[9]/[1]*100as [9],1.0*[10]/[1]*100 as [10],1.0*[11]/[1]*100 as [11],1.0*[12]/[1]*100 as [12],1.0*[13]/[1]*100 as [13]from #cohort_pivotorder by Cohort_Date

Let's Extract this data and visualize the same in Tableau —

The above data is represented in both value and percentage in which it can be seen that there is a drop-in customer purchase month on month.

Thanks for reading. I welcome any comment on this article

COHORT RETENTION METHOD USING SQL

Written by Rakesh Sethu NP

No responses yet