Chapter 4

Aggregate Functions & GROUP BY

Learn how to summarize and analyze data using SQL's powerful aggregate functions. Master COUNT, SUM, AVG, MIN, MAX, GROUP BY, and HAVING to transform raw data into meaningful insights.

What are Aggregate Functions?

Aggregate functions perform calculations on multiple rows of data and return a single result. They're essential for data analysis, reporting, and business intelligence.

Real-Life Analogy: The Report Card

Imagine you're a teacher with 100 students. You have all their individual test scores, but you need to summarize them:

  • COUNT: How many students took the test? (100 students)
  • SUM: What's the total of all scores combined? (8,450 points)
  • AVG: What's the average score? (84.5 points)
  • MAX: What's the highest score? (98 points)
  • MIN: What's the lowest score? (62 points)

Instead of looking at 100 individual scores, aggregate functions give you the summary statistics you need!

🔍 Behind the Scenes: How Aggregates Work

When you run an aggregate function, the database:

  1. Scans all rows that match your WHERE conditions (if any)
  2. Collects the values from the specified column
  3. Performs the calculation (count, sum, average, etc.)
  4. Returns a single aggregated result

This happens incredibly fast, even on millions of rows, thanks to database optimization!

COUNT()

Counts the number of rows

Use case: "How many customers do we have?"

SUM()

Adds up numeric values

Use case: "What's our total revenue?"

AVG()

Calculates the average

Use case: "What's the average order value?"

MIN()

Finds the minimum value

Use case: "What's our lowest price?"

MAX()

Finds the maximum value

Use case: "What's our highest salary?"

GROUP BY

Groups rows for aggregation

Use case: "Sales by department"

COUNT() Function

The COUNT() function returns the number of rows that match your criteria. It's one of the most frequently used aggregate functions.

COUNT() Syntax:

COUNT(*)              -- Counts all rows
COUNT(column_name)    -- Counts non-NULL values in a column
COUNT(DISTINCT column) -- Counts unique non-NULL values
  • COUNT(*): Counts all rows, including those with NULL values
  • COUNT(column): Counts only rows where the column is NOT NULL
  • COUNT(DISTINCT column): Counts unique values only

Try It: Count All Employees

How many employees work at our company?

Output will appear here...

Try It: Count Employees by Department

How many employees are in Engineering?

Output will appear here...

Try It: Count Distinct Departments

How many unique departments do we have?

Output will appear here...

🔍 COUNT(*) vs COUNT(column) - What's the Difference?

COUNT(*):

  • Counts every row in the result set
  • Includes rows where all columns are NULL
  • Generally faster because it doesn't need to check column values

COUNT(column_name):

  • Counts only rows where the specified column is NOT NULL
  • Useful when you want to know how many rows have actual data in a specific column
  • Example: COUNT(email) tells you how many employees have an email address on file

Real-World Example: E-commerce Analytics

Imagine you run an online store. Here's how COUNT() helps:

-- How many orders did we receive today?
SELECT COUNT(*) AS orders_today
FROM orders
WHERE order_date = CURRENT_DATE;

-- How many unique customers made purchases this month?
SELECT COUNT(DISTINCT customer_id) AS unique_customers
FROM orders
WHERE order_date >= DATE('now', 'start of month');

-- How many products have reviews?
SELECT COUNT(review_text) AS products_with_reviews
FROM product_reviews;

SUM() Function

The SUM() function adds up all numeric values in a column. It's perfect for calculating totals like revenue, expenses, or quantities.

SUM() Syntax:

SUM(column_name)
  • Only works with numeric columns (INTEGER, DECIMAL, FLOAT, etc.)
  • Ignores NULL values automatically
  • Returns NULL if no rows match (not zero!)

Real-Life Analogy: Your Bank Account

Think of SUM() like calculating your total savings:

  • Checking account: $1,200
  • Savings account: $5,000
  • Investment account: $3,500

SUM() = $9,700 (total across all accounts)

In SQL, you'd be adding up rows instead of bank accounts!

Try It: Total Salary Expense

What's the total annual salary expense for all employees?

Output will appear here...

Try It: Sum with WHERE Condition

What's the total salary for the Engineering department?

Output will appear here...

Try It: Multiple Calculations

Let's calculate both total and average salary in one query:

Output will appear here...

🔍 Watch Out for NULL Values!

SUM() automatically ignores NULL values:

-- If you have these values: 100, 200, NULL, 300
-- SUM() returns: 600 (not 600 + NULL)
-- COUNT(*) returns: 4 rows
-- COUNT(column) returns: 3 rows (excludes NULL)

This is usually what you want, but be aware that NULL values don't affect the sum!

Real-World Example: Sales Dashboard

-- Total revenue for the year
SELECT SUM(order_total) AS annual_revenue
FROM orders
WHERE STRFTIME('%Y', order_date) = '2024';

-- Total quantity of products sold
SELECT SUM(quantity) AS total_units_sold
FROM order_items;

-- Revenue by product category
SELECT
    category,
    SUM(price * quantity) AS category_revenue
FROM order_items
JOIN products ON order_items.product_id = products.id
GROUP BY category;

AVG() Function

The AVG() function calculates the arithmetic mean (average) of numeric values. It's incredibly useful for understanding typical values in your data.

AVG() Syntax:

AVG(column_name)
  • Only works with numeric columns
  • Ignores NULL values (just like SUM)
  • Returns NULL if no rows match
  • Formula: AVG = SUM(values) / COUNT(non-null values)

Real-Life Analogy: Test Scores

You got these quiz scores: 85, 90, 88, 92

Average = (85 + 90 + 88 + 92) / 4 = 355 / 4 = 88.75

SQL's AVG() does exactly this calculation across all rows in your table!

Try It: Average Salary

What's the average salary across all employees?

Output will appear here...

Try It: Average by Department

What's the average salary in Sales?

Output will appear here...

Try It: Rounded Average

Let's round the average to 2 decimal places for cleaner results:

Output will appear here...

🔍 AVG() and NULL Values

Understanding how AVG() handles NULL is crucial:

-- Suppose you have these salaries: 50000, 60000, NULL, 80000
-- AVG() calculates: (50000 + 60000 + 80000) / 3 = 63,333
-- NOT: (50000 + 60000 + 0 + 80000) / 4 = 47,500

-- NULL is ignored entirely!
-- This is usually correct, but be aware of it.

To treat NULL as zero, use: AVG(COALESCE(salary, 0))

Real-World Example: Business Intelligence

-- Average customer order value
SELECT ROUND(AVG(order_total), 2) AS avg_order_value
FROM orders;

-- Average product rating
SELECT
    product_name,
    ROUND(AVG(rating), 1) AS avg_rating
FROM products
JOIN reviews ON products.id = reviews.product_id
GROUP BY product_name;

-- Average days to ship orders
SELECT AVG(JULIANDAY(shipped_date) - JULIANDAY(order_date)) AS avg_ship_days
FROM orders
WHERE shipped_date IS NOT NULL;

MIN() and MAX() Functions

The MIN() and MAX() functions find the smallest and largest values in a column. They work with numbers, dates, and even text!

MIN() and MAX() Syntax:

MIN(column_name)  -- Returns the smallest value
MAX(column_name)  -- Returns the largest value
  • Work with numeric, date/time, and text columns
  • Ignore NULL values
  • For text: MIN = alphabetically first, MAX = alphabetically last
  • For dates: MIN = earliest date, MAX = latest date

Real-Life Analogy: The Competition

Imagine a marathon race:

  • MIN(finish_time): Who finished first? (fastest time)
  • MAX(finish_time): Who finished last? (slowest time)
  • MIN(age): Who's the youngest runner?
  • MAX(age): Who's the oldest runner?

MIN and MAX help you find the extremes in your data!

Try It: Salary Range

What are the highest and lowest salaries in the company?

Output will appear here...

Try It: Salary Range with Context

Let's add more context with multiple aggregates:

Output will appear here...

Try It: MIN/MAX with Dates

Find the newest and oldest employees (by hire date):

Output will appear here...

🔍 MIN/MAX on Different Data Types

Numbers: Smallest/largest numeric value (10 < 100 < 1000)

Dates: Earliest/latest date (2020-01-01 < 2024-12-31)

Text: Alphabetical order (Apple < Banana < Cherry)

-- Numbers
MIN(price) -- Could return 9.99
MAX(price) -- Could return 999.99

-- Dates
MIN(order_date) -- Returns earliest order
MAX(order_date) -- Returns most recent order

-- Text (alphabetical)
MIN(name) -- Could return "Alice"
MAX(name) -- Could return "Zoe"

Real-World Example: Product Catalog

-- Price range for products
SELECT
    category,
    MIN(price) AS cheapest,
    MAX(price) AS most_expensive,
    MAX(price) - MIN(price) AS price_range
FROM products
GROUP BY category;

-- Date range of orders
SELECT
    MIN(order_date) AS first_order,
    MAX(order_date) AS latest_order,
    JULIANDAY(MAX(order_date)) - JULIANDAY(MIN(order_date)) AS days_in_business
FROM orders;

-- Inventory levels
SELECT
    MIN(stock_quantity) AS lowest_stock,
    MAX(stock_quantity) AS highest_stock
FROM products
WHERE stock_quantity > 0;

GROUP BY Clause

The GROUP BY clause is where aggregate functions become truly powerful! It groups rows with the same values so you can calculate aggregates for each group separately.

GROUP BY Syntax:

SELECT column1, aggregate_function(column2)
FROM table_name
GROUP BY column1;
  • Groups rows that have the same values in specified columns
  • Allows you to calculate aggregates per group
  • Columns in SELECT must be either in GROUP BY or inside an aggregate function

Real-Life Analogy: Sorting Laundry

Imagine you have a pile of mixed laundry:

  • Without GROUP BY: "I have 50 pieces of clothing total" (one big pile)
  • With GROUP BY: "I have 20 shirts, 15 pants, 10 socks, 5 towels" (sorted into categories)

GROUP BY organizes your data into meaningful categories before counting/summing!

Try It: Count by Department

How many employees are in each department?

Output will appear here...

Try It: Average Salary by Department

What's the average salary in each department?

Output will appear here...

Try It: Multiple Aggregates by Group

Let's see comprehensive statistics for each department:

Output will appear here...

🔍 How GROUP BY Works Step-by-Step

When you run a GROUP BY query, the database:

  1. Filters rows using WHERE clause (if present)
  2. Groups rows that have identical values in GROUP BY columns
  3. Calculates aggregates for each group independently
  4. Returns one row per group with the aggregate results
-- Example data:
-- Engineering: Alice (95k), Bob (85k), Charlie (90k)
-- Sales: Diana (75k), Eve (80k)

-- With GROUP BY department:
-- Group 1 (Engineering): COUNT=3, AVG=90k
-- Group 2 (Sales): COUNT=2, AVG=77.5k

Important Rule: SELECT and GROUP BY

Every column in SELECT must be either:

  • Listed in the GROUP BY clause, OR
  • Inside an aggregate function
✅ Correct
SELECT department, COUNT(*)
FROM employees
GROUP BY department;

department is in GROUP BY

❌ Wrong
SELECT department, name, COUNT(*)
FROM employees
GROUP BY department;

name is not in GROUP BY or aggregate

Try It: Group by Multiple Columns

You can group by multiple columns! Count employees by department AND hire year:

Output will appear here...

Real-World Example: Sales Analytics

-- Monthly revenue
SELECT
    STRFTIME('%Y-%m', order_date) AS month,
    COUNT(*) AS total_orders,
    SUM(order_total) AS revenue
FROM orders
GROUP BY STRFTIME('%Y-%m', order_date)
ORDER BY month DESC;

-- Top selling products
SELECT
    product_id,
    product_name,
    SUM(quantity) AS units_sold,
    SUM(price * quantity) AS revenue
FROM order_items
JOIN products ON order_items.product_id = products.id
GROUP BY product_id, product_name
ORDER BY revenue DESC
LIMIT 10;

-- Customer purchase frequency
SELECT
    customer_id,
    COUNT(*) AS order_count,
    AVG(order_total) AS avg_order_value,
    SUM(order_total) AS lifetime_value
FROM orders
GROUP BY customer_id
ORDER BY lifetime_value DESC;

HAVING Clause

The HAVING clause filters groups after aggregation, similar to how WHERE filters individual rows before aggregation.

HAVING Syntax:

SELECT column1, aggregate_function(column2)
FROM table_name
GROUP BY column1
HAVING aggregate_condition;
  • Filters groups based on aggregate values
  • Used after GROUP BY
  • Can use aggregate functions in conditions

Real-Life Analogy: The Guest List

Imagine you're organizing a party and grouping guests by city:

  • WHERE: "Only invite people over 21" (filter individuals before grouping)
  • HAVING: "Only include cities with 5+ guests" (filter groups after counting)

You filter individuals with WHERE, then filter groups with HAVING!

Try It: Departments with Many Employees

Which departments have more than 2 employees?

Output will appear here...

Try It: High Average Salary Departments

Which departments have an average salary above $85,000?

Output will appear here...

Try It: Multiple HAVING Conditions

Departments with 3+ employees AND average salary > $80,000:

Output will appear here...

🔍 HAVING vs WHERE: Execution Order

Understanding when each clause executes is crucial:

  1. FROM: Select the table
  2. WHERE: Filter individual rows (before grouping)
  3. GROUP BY: Group the remaining rows
  4. HAVING: Filter groups (after aggregation)
  5. SELECT: Choose columns to display
  6. ORDER BY: Sort the final results

This is why HAVING can use aggregate functions but WHERE cannot!

Real-World Example: Business Intelligence

-- High-value customers (spent over $1000)
SELECT
    customer_id,
    COUNT(*) AS order_count,
    SUM(order_total) AS total_spent
FROM orders
GROUP BY customer_id
HAVING SUM(order_total) > 1000
ORDER BY total_spent DESC;

-- Popular products (sold 100+ units)
SELECT
    product_id,
    product_name,
    SUM(quantity) AS units_sold
FROM order_items
JOIN products ON order_items.product_id = products.id
GROUP BY product_id, product_name
HAVING SUM(quantity) >= 100
ORDER BY units_sold DESC;

-- Active cities (10+ orders this month)
SELECT
    city,
    COUNT(*) AS order_count,
    SUM(order_total) AS revenue
FROM orders
JOIN customers ON orders.customer_id = customers.id
WHERE order_date >= DATE('now', 'start of month')
GROUP BY city
HAVING COUNT(*) >= 10
ORDER BY revenue DESC;

WHERE vs HAVING: Key Differences

Understanding when to use WHERE vs HAVING is crucial for writing correct queries.

WHERE Clause

Filters: Individual rows

Timing: Before GROUP BY

Use: Filter data before aggregation

Functions: Cannot use aggregates

HAVING Clause

Filters: Grouped results

Timing: After GROUP BY

Use: Filter groups after aggregation

Functions: Can use aggregate functions

Comparison Examples

WHERE (Before Grouping)
-- Filter rows first, then group
SELECT department, AVG(salary)
FROM employees
WHERE salary > 70000
GROUP BY department;

Calculates avg for employees earning 70k+

HAVING (After Grouping)
-- Group first, then filter groups
SELECT department, AVG(salary)
FROM employees
GROUP BY department
HAVING AVG(salary) > 70000;

Shows depts where avg salary is 70k+

Try It: WHERE Only

Count high earners (>$85k) in each department:

Output will appear here...

Try It: HAVING Only

Departments where average salary is above $85,000:

Output will appear here...

Try It: Both WHERE and HAVING

Filter individuals first, then filter groups:

Output will appear here...

🔍 Performance Tip: WHERE is Faster!

When possible, use WHERE instead of HAVING for better performance:

-- SLOWER: Filter after grouping
SELECT department, COUNT(*)
FROM employees
GROUP BY department
HAVING department = 'Engineering';

-- FASTER: Filter before grouping
SELECT department, COUNT(*)
FROM employees
WHERE department = 'Engineering'
GROUP BY department;

WHERE reduces the dataset before expensive GROUP BY operations!

Quick Decision Guide:

  • Use WHERE when: Filtering individual rows (salary > 80000, department = 'Sales')
  • Use HAVING when: Filtering groups with aggregates (AVG(salary) > 80000, COUNT(*) > 5)
  • Use both when: You need to filter rows AND groups

Multiple Aggregates in One Query

You can combine multiple aggregate functions in a single query to build comprehensive reports and dashboards.

Try It: Complete Department Statistics

Get all statistics for each department in one query:

Output will appear here...

Try It: Conditional Aggregates

Count different categories using CASE inside aggregates:

Output will appear here...

🔍 Creating Custom Metrics

You can create calculated fields using aggregate results:

SELECT
    department,
    COUNT(*) AS employees,
    SUM(salary) AS total_payroll,
    -- Average salary per employee
    ROUND(SUM(salary) / COUNT(*), 2) AS avg_salary,
    -- Percent of total company payroll
    ROUND(100.0 * SUM(salary) / (SELECT SUM(salary) FROM employees), 2) AS pct_of_total
FROM employees
GROUP BY department;

Real-World Example: Executive Dashboard

-- Comprehensive sales report
SELECT
    STRFTIME('%Y-%m', order_date) AS month,
    COUNT(*) AS total_orders,
    COUNT(DISTINCT customer_id) AS unique_customers,
    SUM(order_total) AS revenue,
    ROUND(AVG(order_total), 2) AS avg_order_value,
    MIN(order_total) AS min_order,
    MAX(order_total) AS max_order,
    -- Revenue per customer
    ROUND(SUM(order_total) / COUNT(DISTINCT customer_id), 2) AS revenue_per_customer
FROM orders
WHERE order_date >= DATE('now', '-12 months')
GROUP BY STRFTIME('%Y-%m', order_date)
ORDER BY month DESC;

Combining Aggregates with JOINs

The real power of SQL comes from combining JOINs with aggregate functions to analyze related data across multiple tables.

Pattern: JOIN then GROUP

SELECT table1.column, aggregate_function(table2.column)
FROM table1
JOIN table2 ON table1.id = table2.foreign_key
GROUP BY table1.column;

First JOIN the tables, then GROUP BY and aggregate the results!

Try It: Count Employees and Projects

How many employees work on each project?

Output will appear here...

Try It: Department Summary with Location

Show department statistics along with their location:

Output will appear here...

Try It: Employee Salary Distribution

Show employee count distribution by salary ranges:

Output will appear here...

🔍 Watch Out for JOIN Multiplication!

Be careful when joining tables with one-to-many relationships:

-- WRONG: This can count duplicates!
SELECT department, SUM(salary)
FROM employees
JOIN project_assignments ON employees.id = project_assignments.employee_id
GROUP BY department;

-- If an employee has 3 projects, their salary gets counted 3 times!

-- CORRECT: Use DISTINCT or structure your query differently
SELECT department, SUM(DISTINCT salary)
FROM employees
JOIN project_assignments ON employees.id = project_assignments.employee_id
GROUP BY department;

Real-World Example: E-commerce Analytics

-- Customer purchase summary
SELECT
    c.id,
    c.name,
    c.email,
    COUNT(o.id) AS order_count,
    COALESCE(SUM(o.order_total), 0) AS lifetime_value,
    ROUND(AVG(o.order_total), 2) AS avg_order_value,
    MAX(o.order_date) AS last_order_date
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.id, c.name, c.email
ORDER BY lifetime_value DESC
LIMIT 20;

-- Product performance by category
SELECT
    cat.name AS category,
    COUNT(DISTINCT p.id) AS product_count,
    COUNT(oi.id) AS times_ordered,
    SUM(oi.quantity) AS units_sold,
    SUM(oi.price * oi.quantity) AS revenue
FROM categories cat
JOIN products p ON cat.id = p.category_id
LEFT JOIN order_items oi ON p.id = oi.product_id
GROUP BY cat.id, cat.name
ORDER BY revenue DESC;

Practice Exercises

Test your understanding of aggregate functions with these hands-on exercises!

Exercise 1: Total and Average Salary

Calculate the total and average salary for all employees. Round the average to 2 decimal places.

Output will appear here...

Exercise 2: Count Employees in Sales

How many employees work in the Sales department?

Output will appear here...

Exercise 3: Salary Statistics by Department

For each department, show the minimum, maximum, and average salary. Order by average salary descending.

Output will appear here...

Exercise 4: Large Departments

Find departments with more than 3 employees. Show department name and employee count.

Output will appear here...

Exercise 5: High-Paying Departments

Find departments where the average salary exceeds $80,000. Include department name, employee count, and average salary.

Output will appear here...

Exercise 6: Projects by Status

Count how many projects exist for each status. Show status and count, ordered by count descending.

Output will appear here...

Exercise 7: Salary Range Analysis

For employees earning over $70,000, count how many are in each department. Only show departments with at least 2 such employees.

Output will appear here...

Exercise 8: Complete Department Report

Create a comprehensive report showing: department, employee count, total payroll, average salary, and salary range (max - min). Order by total payroll descending.

Output will appear here...

Knowledge Check Quiz

Test your understanding of aggregate functions:

1. What does the COUNT(*) function return?

2. Which clause groups rows for aggregation?

3. What's the difference between WHERE and HAVING?

4. How does SUM() handle NULL values?

5. Which query finds departments with more than 5 employees?

6. What does COUNT(DISTINCT column) do?

7. In a SELECT with GROUP BY, columns must be either:

8. What does AVG(salary) calculate?