Compensation Benchmarking
Ensure competitive and equitable pay by analyzing internal salary data against market benchmarks, calculating compa-ratios, and identifying pay equity gaps across the organization.
Project Overview
Scenario: You are a compensation analyst at a SaaS company (1,800 employees) preparing for the annual compensation review cycle. The CFO has approved a 5% merit increase budget, but the comp team needs data to allocate it strategically. Recent exit interviews flagged below-market pay as a top reason for attrition, especially in Engineering and Product roles. The team also needs to ensure internal equity -- that employees in similar roles with similar experience are paid comparably regardless of demographics.
Objective: Compare internal salaries to market data, calculate compa-ratios for every employee, identify roles where the company is significantly below market, and detect internal equity gaps that need correction.
Tools: SQL (PostgreSQL), Python (pandas, matplotlib, scikit-learn), Microsoft Excel
Dataset Description
Practice with Real Data
Dataset: Salary Data (Kaggle)
Salary records across job titles, years of experience, education levels, age, and gender. Use it for compensation benchmarking and salary prediction models.
Free Kaggle account required to download. The column names may differ slightly from the examples below — adapt the queries and code to match your downloaded data.
Table: employees
| Column | Type | Description |
|---|---|---|
| employee_id | INT (PK) | Unique employee identifier |
| full_name | VARCHAR(100) | Employee full name |
| department | VARCHAR(50) | Department name |
| role_title | VARCHAR(100) | Job title |
| job_level_id | INT (FK) | Reference to job_levels |
| hire_date | DATE | Date of hire |
| gender | VARCHAR(20) | Gender identity |
| location | VARCHAR(50) | Office location |
| is_active | BOOLEAN | Current employment status |
Table: salary
| Column | Type | Description |
|---|---|---|
| salary_id | INT (PK) | Unique salary record |
| employee_id | INT (FK) | Reference to employees |
| base_salary | DECIMAL(10,2) | Annual base salary |
| bonus_target_pct | DECIMAL(4,2) | Target bonus as % of base |
| equity_value | DECIMAL(10,2) | Estimated annual equity value |
| effective_date | DATE | Salary effective date |
Table: market_data
| Column | Type | Description |
|---|---|---|
| market_id | INT (PK) | Unique market data record |
| role_family | VARCHAR(50) | Job family (Engineering, Product, Sales, etc.) |
| job_level_id | INT (FK) | Reference to job_levels |
| location | VARCHAR(50) | Geographic market |
| p25_salary | DECIMAL(10,2) | 25th percentile market salary |
| p50_salary | DECIMAL(10,2) | 50th percentile (median) market salary |
| p75_salary | DECIMAL(10,2) | 75th percentile market salary |
| survey_year | INT | Year of market survey |
Table: job_levels
| Column | Type | Description |
|---|---|---|
| job_level_id | INT (PK) | Unique level identifier |
| level_name | VARCHAR(30) | Level name (L1-L8) |
| level_description | VARCHAR(100) | Description (e.g., Junior IC, Senior IC, Manager) |
| role_family | VARCHAR(50) | Job family this level applies to |
Table: benefits
| Column | Type | Description |
|---|---|---|
| benefit_id | INT (PK) | Unique benefit record |
| employee_id | INT (FK) | Reference to employees |
| benefit_type | VARCHAR(50) | Health, 401k, PTO, etc. |
| employer_cost | DECIMAL(10,2) | Annual employer cost for this benefit |
SQL Analysis
Query 1: Compa-Ratio Calculation for All Employees
Calculate each employee's compa-ratio (actual salary / market midpoint) to show how they are positioned relative to the market. A compa-ratio of 1.0 means at-market, below 0.9 is significantly below market.
WITH current_salary AS (
SELECT DISTINCT ON (s.employee_id)
s.employee_id, s.base_salary, s.bonus_target_pct, s.equity_value
FROM salary s
ORDER BY s.employee_id, s.effective_date DESC
)
SELECT
e.employee_id,
e.full_name,
e.department,
e.role_title,
jl.level_name,
cs.base_salary,
md.p50_salary AS market_midpoint,
ROUND(cs.base_salary / NULLIF(md.p50_salary, 0), 3) AS compa_ratio,
ROUND(md.p50_salary - cs.base_salary, 0) AS gap_to_midpoint,
CASE
WHEN cs.base_salary / NULLIF(md.p50_salary, 0) < 0.85 THEN 'Critical - Far Below Market'
WHEN cs.base_salary / NULLIF(md.p50_salary, 0) < 0.90 THEN 'Below Market'
WHEN cs.base_salary / NULLIF(md.p50_salary, 0) < 1.00 THEN 'Approaching Market'
WHEN cs.base_salary / NULLIF(md.p50_salary, 0) <= 1.10 THEN 'At Market'
ELSE 'Above Market'
END AS market_position
FROM employees e
JOIN current_salary cs ON e.employee_id = cs.employee_id
JOIN job_levels jl ON e.job_level_id = jl.job_level_id
LEFT JOIN market_data md ON jl.job_level_id = md.job_level_id
AND jl.role_family = md.role_family
AND e.location = md.location
AND md.survey_year = 2025
WHERE e.is_active = TRUE
ORDER BY compa_ratio ASC;
Query 2: Salary Distribution by Role Family and Level
Show the internal salary range alongside market bands to identify where the company pays above or below market.
SELECT
jl.role_family,
jl.level_name,
COUNT(e.employee_id) AS headcount,
ROUND(MIN(cs.base_salary), 0) AS internal_min,
ROUND(PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY cs.base_salary), 0) AS internal_p25,
ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY cs.base_salary), 0) AS internal_median,
ROUND(PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY cs.base_salary), 0) AS internal_p75,
ROUND(MAX(cs.base_salary), 0) AS internal_max,
ROUND(AVG(md.p25_salary), 0) AS market_p25,
ROUND(AVG(md.p50_salary), 0) AS market_p50,
ROUND(AVG(md.p75_salary), 0) AS market_p75,
ROUND(
AVG(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY cs.base_salary))
/ NULLIF(AVG(md.p50_salary), 0), 3
) AS dept_compa_ratio
FROM employees e
JOIN (
SELECT DISTINCT ON (employee_id) employee_id, base_salary
FROM salary ORDER BY employee_id, effective_date DESC
) cs ON e.employee_id = cs.employee_id
JOIN job_levels jl ON e.job_level_id = jl.job_level_id
LEFT JOIN market_data md ON jl.job_level_id = md.job_level_id
AND jl.role_family = md.role_family AND md.survey_year = 2025
WHERE e.is_active = TRUE
GROUP BY jl.role_family, jl.level_name, jl.job_level_id
ORDER BY jl.role_family, jl.job_level_id;
Query 3: Internal Equity Gaps (Same Role, Different Pay)
Identify cases where employees in the same role and level have large pay differences that may indicate equity issues.
WITH role_stats AS (
SELECT
e.role_title,
jl.level_name,
e.department,
AVG(cs.base_salary) AS avg_salary,
STDDEV(cs.base_salary) AS salary_stddev,
MIN(cs.base_salary) AS min_salary,
MAX(cs.base_salary) AS max_salary,
MAX(cs.base_salary) - MIN(cs.base_salary) AS salary_spread,
COUNT(*) AS headcount
FROM employees e
JOIN (
SELECT DISTINCT ON (employee_id) employee_id, base_salary
FROM salary ORDER BY employee_id, effective_date DESC
) cs ON e.employee_id = cs.employee_id
JOIN job_levels jl ON e.job_level_id = jl.job_level_id
WHERE e.is_active = TRUE
GROUP BY e.role_title, jl.level_name, e.department
HAVING COUNT(*) >= 3
)
SELECT
role_title,
level_name,
department,
headcount,
ROUND(avg_salary, 0) AS avg_salary,
ROUND(min_salary, 0) AS min_salary,
ROUND(max_salary, 0) AS max_salary,
ROUND(salary_spread, 0) AS spread,
ROUND(salary_spread * 100.0 / NULLIF(avg_salary, 0), 1) AS spread_pct,
ROUND(salary_stddev, 0) AS stddev
FROM role_stats
WHERE salary_spread * 100.0 / NULLIF(avg_salary, 0) > 20
ORDER BY spread_pct DESC;
Query 4: Total Compensation Package Comparison
Compare total compensation (base + bonus + equity + benefits) to give a complete picture beyond base salary alone.
WITH current_comp AS (
SELECT DISTINCT ON (s.employee_id)
s.employee_id,
s.base_salary,
s.base_salary * s.bonus_target_pct / 100 AS target_bonus,
s.equity_value
FROM salary s
ORDER BY s.employee_id, s.effective_date DESC
),
benefit_costs AS (
SELECT employee_id, SUM(employer_cost) AS total_benefits
FROM benefits
GROUP BY employee_id
)
SELECT
e.department,
jl.level_name,
COUNT(*) AS headcount,
ROUND(AVG(cc.base_salary), 0) AS avg_base,
ROUND(AVG(cc.target_bonus), 0) AS avg_bonus,
ROUND(AVG(cc.equity_value), 0) AS avg_equity,
ROUND(AVG(bc.total_benefits), 0) AS avg_benefits,
ROUND(AVG(cc.base_salary + cc.target_bonus + cc.equity_value + COALESCE(bc.total_benefits, 0)), 0) AS avg_total_comp,
ROUND(AVG(cc.base_salary) * 100.0 /
NULLIF(AVG(cc.base_salary + cc.target_bonus + cc.equity_value + COALESCE(bc.total_benefits, 0)), 0), 1
) AS base_pct_of_total
FROM employees e
JOIN current_comp cc ON e.employee_id = cc.employee_id
JOIN job_levels jl ON e.job_level_id = jl.job_level_id
LEFT JOIN benefit_costs bc ON e.employee_id = bc.employee_id
WHERE e.is_active = TRUE
GROUP BY e.department, jl.level_name, jl.job_level_id
ORDER BY e.department, jl.job_level_id;
Python Analysis
Salary Regression Model
Build a regression model to predict expected salary based on role, level, tenure, location, and performance. Use residuals to identify employees who are significantly over- or under-paid relative to their predicted salary.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
df = pd.read_csv('compensation_data.csv')
# Columns: employee_id, department, role_family, level_name, location,
# tenure_years, last_rating, base_salary, gender
features = ['role_family', 'level_name', 'location', 'tenure_years', 'last_rating']
categorical = ['role_family', 'level_name', 'location']
numerical = ['tenure_years', 'last_rating']
X = df[features]
y = df['base_salary']
preprocessor = ColumnTransformer([
('cat', OneHotEncoder(drop='first', sparse_output=False), categorical),
('num', 'passthrough', numerical)
])
model = Pipeline([
('preprocessor', preprocessor),
('regressor', LinearRegression())
])
model.fit(X, y)
df['predicted_salary'] = model.predict(X)
df['residual'] = df['base_salary'] - df['predicted_salary']
df['residual_pct'] = (df['residual'] / df['predicted_salary'] * 100).round(1)
print(f"Model R-squared: {model.score(X, y):.3f}")
print(f"\nEmployees significantly underpaid (residual < -10%):")
underpaid = df[df['residual_pct'] < -10].sort_values('residual_pct')
print(underpaid[['employee_id', 'department', 'role_family', 'level_name',
'base_salary', 'predicted_salary', 'residual_pct']].head(15))
# Residual distribution plot
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(df['residual_pct'], bins=40, color='#3498db', edgecolor='black', alpha=0.7)
ax.axvline(0, color='red', linewidth=2, linestyle='--')
ax.axvline(-10, color='orange', linewidth=1.5, linestyle=':', label='Underpaid threshold (-10%)')
ax.axvline(10, color='green', linewidth=1.5, linestyle=':', label='Overpaid threshold (+10%)')
ax.set_xlabel('Salary Residual (%)')
ax.set_ylabel('Number of Employees')
ax.set_title('Distribution of Salary Residuals (Actual vs. Predicted)', fontsize=13)
ax.legend()
plt.tight_layout()
plt.savefig('salary_residuals.png', dpi=150)
plt.show()
Market Comparison Visualization
Compare internal salary ranges against market bands for each role family and level.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
comp = pd.read_csv('salary_bands.csv')
# Columns: role_family, level_name, internal_median, market_p25, market_p50, market_p75
roles = comp['role_family'] + ' - ' + comp['level_name']
y_pos = np.arange(len(roles))
fig, ax = plt.subplots(figsize=(12, 8))
# Market band (P25 to P75)
ax.barh(y_pos, comp['market_p75'] - comp['market_p25'],
left=comp['market_p25'], height=0.4, color='#bdc3c7',
alpha=0.6, label='Market P25-P75 Range')
# Market midpoint
ax.scatter(comp['market_p50'], y_pos, color='black', s=80,
zorder=5, label='Market Median (P50)', marker='|')
# Internal median
ax.scatter(comp['internal_median'], y_pos, color='#e74c3c', s=100,
zorder=5, label='Internal Median', marker='D')
ax.set_yticks(y_pos)
ax.set_yticklabels(roles, fontsize=9)
ax.set_xlabel('Annual Base Salary ($)', fontsize=12)
ax.set_title('Internal Salary vs. Market Benchmark', fontsize=14)
ax.legend(loc='lower right')
ax.grid(axis='x', alpha=0.3)
# Format x-axis as currency
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'${x:,.0f}'))
plt.tight_layout()
plt.savefig('market_comparison.png', dpi=150)
plt.show()
Equity Analysis by Gender
After controlling for role, level, tenure, and location, test whether a gender pay gap persists.
import pandas as pd
import numpy as np
from scipy import stats
df = pd.read_csv('compensation_data.csv')
# Use the residuals from the salary model (which controls for role, level, etc.)
# A significant difference in residuals by gender = unexplained pay gap
male_residuals = df[df['gender'] == 'Male']['residual_pct']
female_residuals = df[df['gender'] == 'Female']['residual_pct']
t_stat, p_value = stats.ttest_ind(male_residuals, female_residuals)
print("Controlled Pay Gap Analysis (Residual-Based)")
print("=" * 50)
print(f"Male avg residual: {male_residuals.mean():+.2f}%")
print(f"Female avg residual: {female_residuals.mean():+.2f}%")
print(f"Difference: {male_residuals.mean() - female_residuals.mean():.2f}%")
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Significant (p<0.05): {'Yes' if p_value < 0.05 else 'No'}")
# Budget needed to close the gap
gap_employees = df[(df['gender'] == 'Female') & (df['residual_pct'] < -5)]
total_adjustment = (gap_employees['predicted_salary'] * 0.95 - gap_employees['base_salary']).clip(lower=0).sum()
print(f"\nEstimated budget to bring underpaid women to within 5% of predicted: ${total_adjustment:,.0f}")
Excel Dashboard
Build a compensation review workbook that the comp team uses during the annual cycle to make data-driven pay decisions.
- Import Data: Load SQL results into sheets: "Employee Comp", "Market Data", "Equity Gaps", and "Total Comp". Import the Python model outputs (predicted salary, residuals) as a separate sheet. Format all as Excel Tables.
- Salary Band Spreadsheet: Create a reference table with columns: Role Family, Level, Market P25, Market P50 (midpoint), Market P75. Define named ranges for each band. Add a "Band Penetration" column using:
=(employee_salary - P25) / (P75 - P25). Values below 0 are below range; above 1 are above range. - Compa-Ratio Dashboard: Create a PivotTable with Department and Level as rows, average compa-ratio as values. Apply Conditional Formatting (3-Color Scale: red at 0.85, yellow at 1.0, green at 1.10). Add a PivotChart showing compa-ratio distribution as a histogram. Include a slicer for Department.
- Market Comparison Chart: Build a combination chart for each role family: use floating bars (high-low-close chart) to show the P25-P75 market range, overlay markers for each employee's actual salary, and add a diamond marker for the internal median. This provides a powerful visual of market positioning.
- Equity Gap Table: Build a table of roles where the salary spread exceeds 20%. Use Conditional Formatting to highlight spread_pct values above 25% in red. Add a "Recommended Action" dropdown (Data Validation) with options: "Adjust Up", "Monitor", "No Action". Create a summary formula:
=SUMPRODUCT((action="Adjust Up")*adjustment_amount)to estimate total budget impact. - Merit Increase Calculator: Create a merit matrix with Performance Rating on one axis and Compa-Ratio band on the other. Populate cells with suggested increase percentages (e.g., low compa + high performance = 8%, high compa + average performance = 2%). Use INDEX/MATCH to look up each employee's suggested increase. Calculate:
=base_salary * suggested_increase_pct. Sum to verify total stays within the 5% budget. - Total Compensation View: Build a stacked bar chart showing the composition of total comp (base, bonus, equity, benefits) for each level. This helps leadership understand the full value proposition beyond base salary.
- Executive Summary: Create a one-page dashboard tab with: Overall Compa-Ratio, % of Employees Below Market, Estimated Equity Adjustment Budget, Merit Budget Utilization, and a waterfall chart showing how the merit budget flows from total pool to department allocations.
Key Insights
Below-Market Roles
23% of employees have a compa-ratio below 0.90. Engineering (L3-L5) and Product (L4-L5) are the most below-market, with median compa-ratios of 0.87 and 0.89 respectively -- directly correlating with the highest attrition departments.
Internal Equity
14 role-level combinations show salary spreads exceeding 25%, even after controlling for tenure and location. The largest gap is Senior Software Engineer (L4) with a 34% spread, requiring an estimated $180K in equity adjustments.
Total Comp Story
While base salaries trail the market by 6% on average, total compensation (including equity and benefits) is only 2% below market. Communicating the full package could reduce perceived pay gaps and improve retention.
Budget Allocation
The regression model identifies 142 employees needing priority adjustments totaling $890K. With a $1.6M merit budget (5% of payroll), allocating 56% to market corrections and 44% to merit-based increases optimizes both retention and performance incentives.
Knowledge Check
An employee with a compa-ratio of 0.85 is:
Why does the salary regression model use one-hot encoding with drop='first'?
A salary residual of -12% from the regression model means:
In the merit increase matrix, why should employees with high compa-ratios AND high performance receive moderate (not large) increases?
The band penetration formula (salary - P25) / (P75 - P25) returns 0.3 for an employee. This means: