Chapter 5

Subqueries & Advanced Filtering

Master the art of nested queries! Learn how to use subqueries in WHERE, FROM, and SELECT clauses to solve complex data problems elegantly and efficiently.

What is a Subquery?

A subquery (also called an inner query or nested query) is a query embedded inside another query. It allows you to use the results of one query as input for another.

Real-Life Analogy: Asking Follow-Up Questions

Imagine you're planning a party:

  • Main Question: "Who should I invite?"
  • Sub-Question: "First, what's the average age of my friends?" (calculate this)
  • Answer: "Invite everyone within 5 years of the average age" (use the result)

In SQL, the sub-question is a subquery that runs first, and its result is used by the main query!

Basic Subquery Structure:

SELECT column
FROM table
WHERE column operator (
    SELECT column
    FROM table
    WHERE condition
);
  • The inner query (in parentheses) executes first
  • Its result is passed to the outer query
  • The outer query uses that result to filter or calculate

🔍 Behind the Scenes: How Subqueries Execute

When you run a query with a subquery:

  1. Step 1: Database executes the innermost subquery first
  2. Step 2: Result is temporarily stored in memory
  3. Step 3: Outer query uses this result as if it were a literal value
  4. Step 4: Final results are returned to you

Think of it like solving math: solve what's in parentheses first, then use that answer!

WHERE Subqueries

Filter rows based on results from another query

Use: "Employees earning above average"

FROM Subqueries

Use query results as a temporary table

Use: "Derived tables for complex logic"

SELECT Subqueries

Calculate values for each row

Use: "Show each employee's salary vs department avg"

Correlated Subqueries

Reference outer query in inner query

Use: "Compare each row to its group"

EXISTS Subqueries

Check if any rows exist

Use: "Employees who have projects"

IN Subqueries

Check if value matches any result

Use: "Products in specific categories"

Simple Subqueries in WHERE

The most common type of subquery appears in the WHERE clause, filtering rows based on a calculated value.

Pattern: Single-Value Subquery

SELECT columns
FROM table
WHERE column > (SELECT AVG(column) FROM table);

The subquery must return a single value when using comparison operators (=, >, <, etc.)

Try It: Employees Above Average Salary

Find all employees who earn more than the average salary:

Output will appear here...

🔍 What Just Happened?

Let's break down the execution:

-- Step 1: Inner query calculates average
SELECT AVG(salary) FROM employees;
-- Returns: 87000 (example)

-- Step 2: Outer query uses this value
SELECT name, email, salary
FROM employees
WHERE salary > 87000  -- The calculated average
ORDER BY salary DESC;

The subquery runs once, returns one value, and the main query uses it!

Try It: Projects Above Average Budget

Find projects with budgets higher than the average:

Output will appear here...

Try It: Highest Salary

Find employees who earn the maximum salary:

Output will appear here...

Real-World Example: E-commerce Analytics

-- Products priced above average
SELECT product_name, price
FROM products
WHERE price > (SELECT AVG(price) FROM products);

-- Orders larger than average order value
SELECT order_id, customer_id, order_total
FROM orders
WHERE order_total > (SELECT AVG(order_total) FROM orders)
ORDER BY order_total DESC;

-- Customers who spent more than average
SELECT customer_id, name, total_spent
FROM customers
WHERE total_spent > (SELECT AVG(total_spent) FROM customers);

Subqueries with IN Operator

The IN operator lets you check if a value matches any value in a list returned by a subquery. Perfect when the subquery returns multiple values!

IN Subquery Syntax:

SELECT columns
FROM table
WHERE column IN (
    SELECT column
    FROM another_table
    WHERE condition
);

The subquery can return multiple rows, and IN checks if the value matches any of them.

Real-Life Analogy: The Guest List

Imagine checking if someone is on the VIP list:

  • Subquery: "Get all VIP names" → [Alice, Bob, Carol]
  • Main Query: "Is this person IN the VIP list?"
  • Result: True if their name matches any VIP, False otherwise

Try It: Employees in Specific Departments

Find employees in departments with more than 2 people:

Output will appear here...

Try It: High Earners in Engineering

Find employees in Engineering earning above company average:

Output will appear here...

Try It: NOT IN - Employees Below Average

Find employees earning below the company average salary:

Output will appear here...

🔍 Watch Out for NULL Values with NOT IN!

NOT IN behaves unexpectedly with NULL values:

-- If subquery returns: (1, 2, NULL)
WHERE id NOT IN (1, 2, NULL)
-- This returns NO ROWS!

-- Why? NOT IN is equivalent to:
WHERE id != 1 AND id != 2 AND id != NULL
-- Comparing to NULL always returns NULL (not TRUE or FALSE)

-- Solution: Filter out NULLs in the subquery
WHERE id NOT IN (
    SELECT employee_id
    FROM project_assignments
    WHERE employee_id IS NOT NULL  -- Important!
)

Real-World Example: Customer Segmentation

-- Customers who made purchases this month
SELECT customer_id, name, email
FROM customers
WHERE customer_id IN (
    SELECT DISTINCT customer_id
    FROM orders
    WHERE order_date >= DATE('now', 'start of month')
);

-- Products that were never ordered
SELECT product_id, product_name, price
FROM products
WHERE product_id NOT IN (
    SELECT DISTINCT product_id
    FROM order_items
    WHERE product_id IS NOT NULL
);

-- High-value customers (lifetime value > $1000)
SELECT name, email
FROM customers
WHERE customer_id IN (
    SELECT customer_id
    FROM orders
    GROUP BY customer_id
    HAVING SUM(order_total) > 1000
);

Subqueries with Comparison Operators

You can use ANY and ALL keywords with comparison operators to compare against multiple values from a subquery.

ANY and ALL Syntax:

-- ANY: True if comparison is true for ANY value
WHERE column > ANY (subquery)

-- ALL: True if comparison is true for ALL values
WHERE column > ALL (subquery)
  • > ANY: Greater than at least one value (similar to > MIN)
  • > ALL: Greater than every value (same as > MAX)
  • = ANY: Equal to any value (same as IN)

Real-Life Analogy: The Competition

Imagine comparing your test score to others:

  • > ANY: "Did I score higher than at least one person?" (beat the minimum)
  • > ALL: "Did I score higher than everyone?" (beat the maximum)

Try It: Salary Greater Than ANY in Sales

Find non-Sales employees earning more than ANY Sales employee:

Output will appear here...

Try It: Salary Greater Than ALL in Sales

Find employees earning more than ALL Sales employees:

Output will appear here...

Equivalent Queries:

Using ANY/ALL
-- Greater than ANY
WHERE salary > ANY (...)

-- Greater than ALL
WHERE salary > ALL (...)
Using MIN/MAX
-- Same as > ANY
WHERE salary > (SELECT MIN(salary) ...)

-- Same as > ALL
WHERE salary > (SELECT MAX(salary) ...)

MIN/MAX is usually clearer and more commonly used!

Try It: Alternative with MIN/MAX

Same results using MIN and MAX (more readable):

Output will appear here...

Correlated Subqueries

A correlated subquery references columns from the outer query. Unlike simple subqueries, correlated subqueries execute once for each row in the outer query.

Correlated Subquery Pattern:

SELECT outer.column
FROM table outer
WHERE outer.column > (
    SELECT AVG(inner.column)
    FROM table inner
    WHERE inner.category = outer.category  -- References outer query!
);

The inner query uses values from the current row of the outer query.

Real-Life Analogy: Personalized Comparison

Imagine grading students:

  • Simple Subquery: "Which students scored above the overall class average?" (one calculation for all)
  • Correlated Subquery: "Which students scored above their grade level's average?" (calculate separately for each grade)

Correlated subqueries make a personalized comparison for each row!

Try It: Above Department Average

Find employees earning more than their department's average:

Output will appear here...

🔍 How Correlated Subqueries Execute

For each row in the outer query:

  1. Outer query processes one row (e.g., Alice in Engineering)
  2. Inner subquery runs using that row's values (avg salary for Engineering)
  3. Comparison is made (Alice's salary > Engineering average?)
  4. If true, row is included in results
  5. Repeat for next row (e.g., Bob in Engineering)
-- For Alice (Engineering, $95k):
-- Inner query calculates: AVG(salary) WHERE department_id = 1 (Engineering) → $90k
-- Comparison: $95k > $90k → TRUE, include Alice

-- For Eve (Sales, $70k):
-- Inner query calculates: AVG(salary) WHERE department_id = 2 (Sales) → $72.7k
-- Comparison: $70k > $72.7k → FALSE, exclude Eve

Try It: Projects with Above-Average Team Size

Find projects that have more employees than the average for their status:

Output will appear here...

Performance Warning!

Correlated subqueries can be slow because they execute many times:

-- SLOW: Runs subquery for each row (1000 rows = 1000 executions)
SELECT name, salary
FROM employees e1
WHERE salary > (
    SELECT AVG(salary)
    FROM employees e2
    WHERE e2.department_id = e1.department_id
);

-- FASTER: Join to pre-calculated averages (2 queries instead of 1000)
SELECT e.name, e.salary
FROM employees e
JOIN (
    SELECT department_id, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department_id
) dept_avg ON e.department_id = dept_avg.department_id
WHERE e.salary > dept_avg.avg_salary;

Use JOINs when possible for better performance!

EXISTS Operator

The EXISTS operator checks if a subquery returns any rows. It returns TRUE if the subquery finds at least one matching row, FALSE otherwise.

EXISTS Syntax:

SELECT columns
FROM table1
WHERE EXISTS (
    SELECT 1
    FROM table2
    WHERE table2.id = table1.id
);
  • Returns TRUE if subquery finds any rows
  • Returns FALSE if subquery finds no rows
  • More efficient than IN for large datasets
  • Often used with correlated subqueries

Real-Life Analogy: The Membership Check

Imagine checking gym memberships:

  • Question: "Does this person have an active membership?"
  • EXISTS: Look for ANY active membership record
  • Result: TRUE if found (even one), FALSE if not found

EXISTS doesn't care HOW MANY matches exist, just whether ANY exist!

Try It: Employees with High Salaries

Find employees earning above $85,000:

Output will appear here...

Try It: Active Projects List

Find projects with status 'Active':

Output will appear here...

🔍 EXISTS vs IN: Performance Difference

EXISTS is often faster than IN because it stops as soon as it finds ONE match:

-- EXISTS: Stops at first match
WHERE EXISTS (
    SELECT 1
    FROM orders
    WHERE orders.customer_id = customers.id
)
-- Finds first order, stops immediately, returns TRUE

-- IN: Must get all values
WHERE customer_id IN (
    SELECT customer_id
    FROM orders
)
-- Must get ALL customer_ids, then check membership

Use EXISTS for "does any exist?" and IN for "is in this specific list?"

Why SELECT 1 in EXISTS?

You'll often see SELECT 1 in EXISTS subqueries:

WHERE EXISTS (
    SELECT 1  -- Could be *, NULL, or any value
    FROM table
    WHERE condition
)

EXISTS only checks if rows exist, not what data they contain. SELECT 1 is a convention meaning "we don't care about the actual values."

Try It: Top Earners Above Average

Find employees earning more than the overall average:

Output will appear here...

Real-World Example: Customer Activity

-- Active customers (made at least one purchase this year)
SELECT customer_id, name, email
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
    AND STRFTIME('%Y', o.order_date) = '2024'
);

-- Products with reviews
SELECT product_id, product_name, price
FROM products p
WHERE EXISTS (
    SELECT 1
    FROM reviews r
    WHERE r.product_id = p.product_id
);

-- Customers who bought premium products
SELECT name, email
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    JOIN products p ON oi.product_id = p.product_id
    WHERE o.customer_id = c.customer_id
    AND p.category = 'Premium'
);

NOT EXISTS

NOT EXISTS is the opposite of EXISTS. It returns TRUE when the subquery finds no rows, making it perfect for finding missing relationships.

NOT EXISTS Syntax:

SELECT columns
FROM table1
WHERE NOT EXISTS (
    SELECT 1
    FROM table2
    WHERE table2.id = table1.id
);

Returns rows from table1 that have NO matching rows in table2.

Try It: Employees Earning Below Average

Find employees earning below the company average:

Output will appear here...

Try It: Projects Without Assignments

Find projects that have no employees assigned:

Output will appear here...

NOT EXISTS vs NOT IN

NOT EXISTS (Recommended)
WHERE NOT EXISTS (
    SELECT 1
    FROM table2
    WHERE table2.id = table1.id
)

✅ Handles NULL correctly
✅ Often faster
✅ Stops at first match

NOT IN (Watch for NULLs!)
WHERE id NOT IN (
    SELECT id
    FROM table2
    WHERE id IS NOT NULL
)

⚠️ Breaks with NULL
⚠️ Must get all values
⚠️ Needs NULL filter

Recommendation: Prefer NOT EXISTS over NOT IN for better NULL handling and performance!

Try It: Completed Projects Only

Find projects that are completed (not active):

Output will appear here...

Real-World Example: Finding Gaps

-- Inactive customers (no orders in the last 6 months)
SELECT customer_id, name, email, last_login
FROM customers c
WHERE NOT EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
    AND o.order_date >= DATE('now', '-6 months')
);

-- Products never purchased
SELECT product_id, product_name, price, stock
FROM products p
WHERE NOT EXISTS (
    SELECT 1
    FROM order_items oi
    WHERE oi.product_id = p.product_id
);

-- Employees who never attended training
SELECT employee_id, name, hire_date
FROM employees e
WHERE NOT EXISTS (
    SELECT 1
    FROM training_attendance t
    WHERE t.employee_id = e.employee_id
)
AND hire_date < DATE('now', '-90 days');

Subqueries in FROM Clause (Derived Tables)

You can use a subquery in the FROM clause to create a derived table (also called an inline view). This temporary result set can be queried like a regular table.

Derived Table Syntax:

SELECT columns
FROM (
    SELECT columns
    FROM table
    WHERE condition
) AS derived_table_name
WHERE another_condition;
  • Subquery creates a temporary table
  • Must give it an alias (AS name)
  • Can query it like any other table
  • Great for multi-step logic

Real-Life Analogy: The Intermediate Step

Think of it like cooking:

  • Step 1: Chop vegetables (subquery creates derived table)
  • Step 2: Cook the chopped vegetables (main query uses derived table)

You can't cook un-chopped vegetables! The derived table is your prepared ingredients.

Try It: Department Salary Statistics

Get statistics about departments, then filter them:

Output will appear here...

Try It: Top Earners Per Department

Find the highest earner in each department:

Output will appear here...

Try It: Ranking and Filtering

Create numbered rows, then filter by that number:

Output will appear here...

🔍 Why Use Derived Tables?

Derived tables help you:

  1. Break complex logic into steps - Easier to read and debug
  2. Filter aggregate results - WHERE can't filter aggregates, but you can filter the derived table
  3. Join aggregated data - Join summary statistics back to detail rows
  4. Reuse calculations - Calculate once, use multiple times

Real-World Example: Sales Analysis

-- Monthly sales with year-over-year comparison
SELECT
    current.month,
    current.revenue AS current_year_revenue,
    previous.revenue AS previous_year_revenue,
    ROUND(100.0 * (current.revenue - previous.revenue) / previous.revenue, 2) AS growth_pct
FROM (
    SELECT
        STRFTIME('%m', order_date) AS month,
        SUM(order_total) AS revenue
    FROM orders
    WHERE STRFTIME('%Y', order_date) = '2024'
    GROUP BY STRFTIME('%m', order_date)
) AS current
LEFT JOIN (
    SELECT
        STRFTIME('%m', order_date) AS month,
        SUM(order_total) AS revenue
    FROM orders
    WHERE STRFTIME('%Y', order_date) = '2023'
    GROUP BY STRFTIME('%m', order_date)
) AS previous ON current.month = previous.month
ORDER BY current.month;

Subqueries in SELECT Clause (Scalar Subqueries)

A scalar subquery in the SELECT clause returns a single value for each row in the result set. It's like adding a calculated column based on related data.

Scalar Subquery Syntax:

SELECT
    column1,
    column2,
    (SELECT calculation FROM table2 WHERE condition) AS calculated_column
FROM table1;
  • Must return exactly ONE value (single row, single column)
  • Executes for each row in the outer query
  • Useful for adding related calculations

Try It: Employee Salary vs Department Average

Show each employee's salary compared to their department average:

Output will appear here...

Try It: Project with Team Count

Show each project with its team size:

Output will appear here...

Try It: Multiple Calculated Columns

Add several subquery calculations:

Output will appear here...

🔍 Performance Warning: SELECT Subqueries Can Be Slow

SELECT subqueries execute once per row:

-- If you have 1000 employees, this runs 1001 queries!
SELECT
    name,
    (SELECT AVG(salary) FROM employees WHERE department_id = e.department_id)
FROM employees e;

-- 1 outer query + 1000 subqueries = slow!

-- BETTER: Use JOIN to pre-calculate
SELECT
    e.name,
    dept_avg.avg_salary
FROM employees e
JOIN (
    SELECT department_id, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department_id
) dept_avg ON e.department_id = dept_avg.department_id;

Real-World Example: Customer Dashboard

-- Customer summary with related calculations
SELECT
    c.customer_id,
    c.name,
    c.email,
    (SELECT COUNT(*)
     FROM orders o
     WHERE o.customer_id = c.customer_id) AS order_count,
    (SELECT COALESCE(SUM(order_total), 0)
     FROM orders o
     WHERE o.customer_id = c.customer_id) AS lifetime_value,
    (SELECT MAX(order_date)
     FROM orders o
     WHERE o.customer_id = c.customer_id) AS last_order_date,
    (SELECT AVG(order_total)
     FROM orders o
     WHERE o.customer_id = c.customer_id) AS avg_order_value,
    CASE
        WHEN (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) > 10
        THEN 'VIP'
        WHEN (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) > 5
        THEN 'Regular'
        ELSE 'New'
    END AS customer_tier
FROM customers c
ORDER BY lifetime_value DESC
LIMIT 20;

Performance Considerations

Subqueries are powerful but can impact performance. Understanding when and how to optimize them is crucial for working with large datasets.

✅ Good: Simple Subqueries

Uncorrelated subqueries that run once

Example: WHERE salary > (SELECT AVG(salary) FROM employees)

✅ Good: EXISTS/NOT EXISTS

Stops at first match, very efficient

Example: WHERE EXISTS (SELECT 1...)

⚠️ Careful: Correlated Subqueries

Runs once per row, can be slow

Tip: Consider JOIN alternative

⚠️ Careful: SELECT Subqueries

Executes for every row returned

Tip: Use derived tables instead

❌ Avoid: NOT IN with NULLs

Can return unexpected results

Use: NOT EXISTS instead

🚀 Optimize: Use Indexes

Index columns used in subquery joins

Benefit: Dramatic speed improvement

Optimization Examples

❌ Slow Correlated Subquery
SELECT name, salary
FROM employees e1
WHERE salary > (
    SELECT AVG(salary)
    FROM employees e2
    WHERE e2.dept = e1.dept
);

Runs subquery for each employee

✅ Fast JOIN Alternative
SELECT e.name, e.salary
FROM employees e
JOIN (
    SELECT dept, AVG(salary) AS avg
    FROM employees
    GROUP BY dept
) d ON e.dept = d.dept
WHERE e.salary > d.avg;

Calculates averages once, then joins

🔍 Query Execution Analysis

To understand performance, consider:

  • How many times does the subquery execute? Once or once per row?
  • How many rows does it scan? Can indexes help?
  • Can it be rewritten as a JOIN? JOINs are often optimized better
  • Does it need all the data? LIMIT can help scalar subqueries

When to Use Each Type:

Scenario Best Choice Why
Single calculated value Simple subquery Runs once, very fast
Check if relationship exists EXISTS Stops at first match
Match against list IN (small lists) Clear and simple
Match against large list EXISTS or JOIN Better performance
Multi-step calculations Derived table (FROM) Break logic into steps
Per-row calculations JOIN to derived table Calculate once, not per row

Practice Exercises

Master subqueries with these challenging exercises!

Exercise 1: Above Average Employees

Find all employees earning more than the company-wide average salary. Show name, department, and salary.

Output will appear here...

Exercise 2: Employees in Large Departments

Find employees who work in departments with more than 4 people. Use IN and a subquery.

Output will appear here...

Exercise 3: Department Top Earners

Find employees who earn more than their department's average. Use a correlated subquery.

Output will appear here...

Exercise 4: Employees with Projects (EXISTS)

Use EXISTS to find employees who are assigned to at least one project.

Output will appear here...

Exercise 5: Employees Without Projects (NOT EXISTS)

Use NOT EXISTS to find employees with no project assignments.

Output will appear here...

Exercise 6: Highest Earner Per Department

Use a derived table to find the employee with the highest salary in each department.

Output will appear here...

Exercise 7: Salary Comparison Column

Add a column showing each employee's salary compared to the company average. Use a SELECT subquery.

Output will appear here...

Exercise 8: Complex Multi-Subquery Challenge

Find departments where the average salary is above the company average AND have at least 3 employees. Show department, employee count, and average salary.

Output will appear here...

Knowledge Check Quiz

Test your understanding of subqueries:

1. What is a subquery?

2. When does a simple (uncorrelated) subquery execute?

3. What does EXISTS check?

4. What's the main difference between a simple and correlated subquery?

5. Why is NOT EXISTS preferred over NOT IN?

6. What must you do when using a subquery in FROM?

7. A scalar subquery in SELECT must return:

8. Which is generally fastest for large datasets?