Project 25

Compensation Benchmarking

Ensure competitive and equitable pay by analyzing internal salary data against market benchmarks, calculating compa-ratios, and identifying pay equity gaps across the organization.

Project Overview

Scenario: You are a compensation analyst at a SaaS company (1,800 employees) preparing for the annual compensation review cycle. The CFO has approved a 5% merit increase budget, but the comp team needs data to allocate it strategically. Recent exit interviews flagged below-market pay as a top reason for attrition, especially in Engineering and Product roles. The team also needs to ensure internal equity -- that employees in similar roles with similar experience are paid comparably regardless of demographics.

Objective: Compare internal salaries to market data, calculate compa-ratios for every employee, identify roles where the company is significantly below market, and detect internal equity gaps that need correction.

Tools: SQL (PostgreSQL), Python (pandas, matplotlib, scikit-learn), Microsoft Excel

Dataset Description

Practice with Real Data

Dataset: Salary Data (Kaggle)

Salary records across job titles, years of experience, education levels, age, and gender. Use it for compensation benchmarking and salary prediction models.

Free Kaggle account required to download. The column names may differ slightly from the examples below — adapt the queries and code to match your downloaded data.

Table: employees

ColumnTypeDescription
employee_idINT (PK)Unique employee identifier
full_nameVARCHAR(100)Employee full name
departmentVARCHAR(50)Department name
role_titleVARCHAR(100)Job title
job_level_idINT (FK)Reference to job_levels
hire_dateDATEDate of hire
genderVARCHAR(20)Gender identity
locationVARCHAR(50)Office location
is_activeBOOLEANCurrent employment status

Table: salary

ColumnTypeDescription
salary_idINT (PK)Unique salary record
employee_idINT (FK)Reference to employees
base_salaryDECIMAL(10,2)Annual base salary
bonus_target_pctDECIMAL(4,2)Target bonus as % of base
equity_valueDECIMAL(10,2)Estimated annual equity value
effective_dateDATESalary effective date

Table: market_data

ColumnTypeDescription
market_idINT (PK)Unique market data record
role_familyVARCHAR(50)Job family (Engineering, Product, Sales, etc.)
job_level_idINT (FK)Reference to job_levels
locationVARCHAR(50)Geographic market
p25_salaryDECIMAL(10,2)25th percentile market salary
p50_salaryDECIMAL(10,2)50th percentile (median) market salary
p75_salaryDECIMAL(10,2)75th percentile market salary
survey_yearINTYear of market survey

Table: job_levels

ColumnTypeDescription
job_level_idINT (PK)Unique level identifier
level_nameVARCHAR(30)Level name (L1-L8)
level_descriptionVARCHAR(100)Description (e.g., Junior IC, Senior IC, Manager)
role_familyVARCHAR(50)Job family this level applies to

Table: benefits

ColumnTypeDescription
benefit_idINT (PK)Unique benefit record
employee_idINT (FK)Reference to employees
benefit_typeVARCHAR(50)Health, 401k, PTO, etc.
employer_costDECIMAL(10,2)Annual employer cost for this benefit

SQL Analysis

Query 1: Compa-Ratio Calculation for All Employees

Calculate each employee's compa-ratio (actual salary / market midpoint) to show how they are positioned relative to the market. A compa-ratio of 1.0 means at-market, below 0.9 is significantly below market.

WITH current_salary AS (
    SELECT DISTINCT ON (s.employee_id)
        s.employee_id, s.base_salary, s.bonus_target_pct, s.equity_value
    FROM salary s
    ORDER BY s.employee_id, s.effective_date DESC
)
SELECT
    e.employee_id,
    e.full_name,
    e.department,
    e.role_title,
    jl.level_name,
    cs.base_salary,
    md.p50_salary AS market_midpoint,
    ROUND(cs.base_salary / NULLIF(md.p50_salary, 0), 3) AS compa_ratio,
    ROUND(md.p50_salary - cs.base_salary, 0) AS gap_to_midpoint,
    CASE
        WHEN cs.base_salary / NULLIF(md.p50_salary, 0) < 0.85 THEN 'Critical - Far Below Market'
        WHEN cs.base_salary / NULLIF(md.p50_salary, 0) < 0.90 THEN 'Below Market'
        WHEN cs.base_salary / NULLIF(md.p50_salary, 0) < 1.00 THEN 'Approaching Market'
        WHEN cs.base_salary / NULLIF(md.p50_salary, 0) <= 1.10 THEN 'At Market'
        ELSE 'Above Market'
    END AS market_position
FROM employees e
JOIN current_salary cs ON e.employee_id = cs.employee_id
JOIN job_levels jl ON e.job_level_id = jl.job_level_id
LEFT JOIN market_data md ON jl.job_level_id = md.job_level_id
    AND jl.role_family = md.role_family
    AND e.location = md.location
    AND md.survey_year = 2025
WHERE e.is_active = TRUE
ORDER BY compa_ratio ASC;

Query 2: Salary Distribution by Role Family and Level

Show the internal salary range alongside market bands to identify where the company pays above or below market.

SELECT
    jl.role_family,
    jl.level_name,
    COUNT(e.employee_id) AS headcount,
    ROUND(MIN(cs.base_salary), 0) AS internal_min,
    ROUND(PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY cs.base_salary), 0) AS internal_p25,
    ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY cs.base_salary), 0) AS internal_median,
    ROUND(PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY cs.base_salary), 0) AS internal_p75,
    ROUND(MAX(cs.base_salary), 0) AS internal_max,
    ROUND(AVG(md.p25_salary), 0) AS market_p25,
    ROUND(AVG(md.p50_salary), 0) AS market_p50,
    ROUND(AVG(md.p75_salary), 0) AS market_p75,
    ROUND(
        AVG(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY cs.base_salary))
        / NULLIF(AVG(md.p50_salary), 0), 3
    ) AS dept_compa_ratio
FROM employees e
JOIN (
    SELECT DISTINCT ON (employee_id) employee_id, base_salary
    FROM salary ORDER BY employee_id, effective_date DESC
) cs ON e.employee_id = cs.employee_id
JOIN job_levels jl ON e.job_level_id = jl.job_level_id
LEFT JOIN market_data md ON jl.job_level_id = md.job_level_id
    AND jl.role_family = md.role_family AND md.survey_year = 2025
WHERE e.is_active = TRUE
GROUP BY jl.role_family, jl.level_name, jl.job_level_id
ORDER BY jl.role_family, jl.job_level_id;

Query 3: Internal Equity Gaps (Same Role, Different Pay)

Identify cases where employees in the same role and level have large pay differences that may indicate equity issues.

WITH role_stats AS (
    SELECT
        e.role_title,
        jl.level_name,
        e.department,
        AVG(cs.base_salary) AS avg_salary,
        STDDEV(cs.base_salary) AS salary_stddev,
        MIN(cs.base_salary) AS min_salary,
        MAX(cs.base_salary) AS max_salary,
        MAX(cs.base_salary) - MIN(cs.base_salary) AS salary_spread,
        COUNT(*) AS headcount
    FROM employees e
    JOIN (
        SELECT DISTINCT ON (employee_id) employee_id, base_salary
        FROM salary ORDER BY employee_id, effective_date DESC
    ) cs ON e.employee_id = cs.employee_id
    JOIN job_levels jl ON e.job_level_id = jl.job_level_id
    WHERE e.is_active = TRUE
    GROUP BY e.role_title, jl.level_name, e.department
    HAVING COUNT(*) >= 3
)
SELECT
    role_title,
    level_name,
    department,
    headcount,
    ROUND(avg_salary, 0) AS avg_salary,
    ROUND(min_salary, 0) AS min_salary,
    ROUND(max_salary, 0) AS max_salary,
    ROUND(salary_spread, 0) AS spread,
    ROUND(salary_spread * 100.0 / NULLIF(avg_salary, 0), 1) AS spread_pct,
    ROUND(salary_stddev, 0) AS stddev
FROM role_stats
WHERE salary_spread * 100.0 / NULLIF(avg_salary, 0) > 20
ORDER BY spread_pct DESC;

Query 4: Total Compensation Package Comparison

Compare total compensation (base + bonus + equity + benefits) to give a complete picture beyond base salary alone.

WITH current_comp AS (
    SELECT DISTINCT ON (s.employee_id)
        s.employee_id,
        s.base_salary,
        s.base_salary * s.bonus_target_pct / 100 AS target_bonus,
        s.equity_value
    FROM salary s
    ORDER BY s.employee_id, s.effective_date DESC
),
benefit_costs AS (
    SELECT employee_id, SUM(employer_cost) AS total_benefits
    FROM benefits
    GROUP BY employee_id
)
SELECT
    e.department,
    jl.level_name,
    COUNT(*) AS headcount,
    ROUND(AVG(cc.base_salary), 0) AS avg_base,
    ROUND(AVG(cc.target_bonus), 0) AS avg_bonus,
    ROUND(AVG(cc.equity_value), 0) AS avg_equity,
    ROUND(AVG(bc.total_benefits), 0) AS avg_benefits,
    ROUND(AVG(cc.base_salary + cc.target_bonus + cc.equity_value + COALESCE(bc.total_benefits, 0)), 0) AS avg_total_comp,
    ROUND(AVG(cc.base_salary) * 100.0 /
        NULLIF(AVG(cc.base_salary + cc.target_bonus + cc.equity_value + COALESCE(bc.total_benefits, 0)), 0), 1
    ) AS base_pct_of_total
FROM employees e
JOIN current_comp cc ON e.employee_id = cc.employee_id
JOIN job_levels jl ON e.job_level_id = jl.job_level_id
LEFT JOIN benefit_costs bc ON e.employee_id = bc.employee_id
WHERE e.is_active = TRUE
GROUP BY e.department, jl.level_name, jl.job_level_id
ORDER BY e.department, jl.job_level_id;

Python Analysis

Salary Regression Model

Build a regression model to predict expected salary based on role, level, tenure, location, and performance. Use residuals to identify employees who are significantly over- or under-paid relative to their predicted salary.

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

df = pd.read_csv('compensation_data.csv')
# Columns: employee_id, department, role_family, level_name, location,
#           tenure_years, last_rating, base_salary, gender

features = ['role_family', 'level_name', 'location', 'tenure_years', 'last_rating']
categorical = ['role_family', 'level_name', 'location']
numerical = ['tenure_years', 'last_rating']

X = df[features]
y = df['base_salary']

preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(drop='first', sparse_output=False), categorical),
    ('num', 'passthrough', numerical)
])

model = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

model.fit(X, y)
df['predicted_salary'] = model.predict(X)
df['residual'] = df['base_salary'] - df['predicted_salary']
df['residual_pct'] = (df['residual'] / df['predicted_salary'] * 100).round(1)

print(f"Model R-squared: {model.score(X, y):.3f}")
print(f"\nEmployees significantly underpaid (residual < -10%):")
underpaid = df[df['residual_pct'] < -10].sort_values('residual_pct')
print(underpaid[['employee_id', 'department', 'role_family', 'level_name',
                  'base_salary', 'predicted_salary', 'residual_pct']].head(15))

# Residual distribution plot
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(df['residual_pct'], bins=40, color='#3498db', edgecolor='black', alpha=0.7)
ax.axvline(0, color='red', linewidth=2, linestyle='--')
ax.axvline(-10, color='orange', linewidth=1.5, linestyle=':', label='Underpaid threshold (-10%)')
ax.axvline(10, color='green', linewidth=1.5, linestyle=':', label='Overpaid threshold (+10%)')
ax.set_xlabel('Salary Residual (%)')
ax.set_ylabel('Number of Employees')
ax.set_title('Distribution of Salary Residuals (Actual vs. Predicted)', fontsize=13)
ax.legend()
plt.tight_layout()
plt.savefig('salary_residuals.png', dpi=150)
plt.show()

Market Comparison Visualization

Compare internal salary ranges against market bands for each role family and level.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

comp = pd.read_csv('salary_bands.csv')
# Columns: role_family, level_name, internal_median, market_p25, market_p50, market_p75

roles = comp['role_family'] + ' - ' + comp['level_name']
y_pos = np.arange(len(roles))

fig, ax = plt.subplots(figsize=(12, 8))

# Market band (P25 to P75)
ax.barh(y_pos, comp['market_p75'] - comp['market_p25'],
        left=comp['market_p25'], height=0.4, color='#bdc3c7',
        alpha=0.6, label='Market P25-P75 Range')

# Market midpoint
ax.scatter(comp['market_p50'], y_pos, color='black', s=80,
           zorder=5, label='Market Median (P50)', marker='|')

# Internal median
ax.scatter(comp['internal_median'], y_pos, color='#e74c3c', s=100,
           zorder=5, label='Internal Median', marker='D')

ax.set_yticks(y_pos)
ax.set_yticklabels(roles, fontsize=9)
ax.set_xlabel('Annual Base Salary ($)', fontsize=12)
ax.set_title('Internal Salary vs. Market Benchmark', fontsize=14)
ax.legend(loc='lower right')
ax.grid(axis='x', alpha=0.3)

# Format x-axis as currency
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x:,.0f}'))

plt.tight_layout()
plt.savefig('market_comparison.png', dpi=150)
plt.show()

Equity Analysis by Gender

After controlling for role, level, tenure, and location, test whether a gender pay gap persists.

import pandas as pd
import numpy as np
from scipy import stats

df = pd.read_csv('compensation_data.csv')

# Use the residuals from the salary model (which controls for role, level, etc.)
# A significant difference in residuals by gender = unexplained pay gap

male_residuals = df[df['gender'] == 'Male']['residual_pct']
female_residuals = df[df['gender'] == 'Female']['residual_pct']

t_stat, p_value = stats.ttest_ind(male_residuals, female_residuals)

print("Controlled Pay Gap Analysis (Residual-Based)")
print("=" * 50)
print(f"Male avg residual:   {male_residuals.mean():+.2f}%")
print(f"Female avg residual: {female_residuals.mean():+.2f}%")
print(f"Difference:          {male_residuals.mean() - female_residuals.mean():.2f}%")
print(f"T-statistic:         {t_stat:.3f}")
print(f"P-value:             {p_value:.4f}")
print(f"Significant (p<0.05): {'Yes' if p_value < 0.05 else 'No'}")

# Budget needed to close the gap
gap_employees = df[(df['gender'] == 'Female') & (df['residual_pct'] < -5)]
total_adjustment = (gap_employees['predicted_salary'] * 0.95 - gap_employees['base_salary']).clip(lower=0).sum()
print(f"\nEstimated budget to bring underpaid women to within 5% of predicted: ${total_adjustment:,.0f}")

Excel Dashboard

Build a compensation review workbook that the comp team uses during the annual cycle to make data-driven pay decisions.

  1. Import Data: Load SQL results into sheets: "Employee Comp", "Market Data", "Equity Gaps", and "Total Comp". Import the Python model outputs (predicted salary, residuals) as a separate sheet. Format all as Excel Tables.
  2. Salary Band Spreadsheet: Create a reference table with columns: Role Family, Level, Market P25, Market P50 (midpoint), Market P75. Define named ranges for each band. Add a "Band Penetration" column using: =(employee_salary - P25) / (P75 - P25). Values below 0 are below range; above 1 are above range.
  3. Compa-Ratio Dashboard: Create a PivotTable with Department and Level as rows, average compa-ratio as values. Apply Conditional Formatting (3-Color Scale: red at 0.85, yellow at 1.0, green at 1.10). Add a PivotChart showing compa-ratio distribution as a histogram. Include a slicer for Department.
  4. Market Comparison Chart: Build a combination chart for each role family: use floating bars (high-low-close chart) to show the P25-P75 market range, overlay markers for each employee's actual salary, and add a diamond marker for the internal median. This provides a powerful visual of market positioning.
  5. Equity Gap Table: Build a table of roles where the salary spread exceeds 20%. Use Conditional Formatting to highlight spread_pct values above 25% in red. Add a "Recommended Action" dropdown (Data Validation) with options: "Adjust Up", "Monitor", "No Action". Create a summary formula: =SUMPRODUCT((action="Adjust Up")*adjustment_amount) to estimate total budget impact.
  6. Merit Increase Calculator: Create a merit matrix with Performance Rating on one axis and Compa-Ratio band on the other. Populate cells with suggested increase percentages (e.g., low compa + high performance = 8%, high compa + average performance = 2%). Use INDEX/MATCH to look up each employee's suggested increase. Calculate: =base_salary * suggested_increase_pct. Sum to verify total stays within the 5% budget.
  7. Total Compensation View: Build a stacked bar chart showing the composition of total comp (base, bonus, equity, benefits) for each level. This helps leadership understand the full value proposition beyond base salary.
  8. Executive Summary: Create a one-page dashboard tab with: Overall Compa-Ratio, % of Employees Below Market, Estimated Equity Adjustment Budget, Merit Budget Utilization, and a waterfall chart showing how the merit budget flows from total pool to department allocations.

Key Insights

Below-Market Roles

23% of employees have a compa-ratio below 0.90. Engineering (L3-L5) and Product (L4-L5) are the most below-market, with median compa-ratios of 0.87 and 0.89 respectively -- directly correlating with the highest attrition departments.

Internal Equity

14 role-level combinations show salary spreads exceeding 25%, even after controlling for tenure and location. The largest gap is Senior Software Engineer (L4) with a 34% spread, requiring an estimated $180K in equity adjustments.

Total Comp Story

While base salaries trail the market by 6% on average, total compensation (including equity and benefits) is only 2% below market. Communicating the full package could reduce perceived pay gaps and improve retention.

Budget Allocation

The regression model identifies 142 employees needing priority adjustments totaling $890K. With a $1.6M merit budget (5% of payroll), allocating 56% to market corrections and 44% to merit-based increases optimizes both retention and performance incentives.

Knowledge Check

Question 1

An employee with a compa-ratio of 0.85 is:

Question 2

Why does the salary regression model use one-hot encoding with drop='first'?

Question 3

A salary residual of -12% from the regression model means:

Question 4

In the merit increase matrix, why should employees with high compa-ratios AND high performance receive moderate (not large) increases?

Question 5

The band penetration formula (salary - P25) / (P75 - P25) returns 0.3 for an employee. This means:

← Previous Project Next Project →