Subqueries & Advanced Filtering
Master the art of nested queries! Learn how to use subqueries in WHERE, FROM, and SELECT clauses to solve complex data problems elegantly and efficiently.
What is a Subquery?
A subquery (also called an inner query or nested query) is a query embedded inside another query. It allows you to use the results of one query as input for another.
Real-Life Analogy: Asking Follow-Up Questions
Imagine you're planning a party:
- Main Question: "Who should I invite?"
- Sub-Question: "First, what's the average age of my friends?" (calculate this)
- Answer: "Invite everyone within 5 years of the average age" (use the result)
In SQL, the sub-question is a subquery that runs first, and its result is used by the main query!
Basic Subquery Structure:
SELECT column
FROM table
WHERE column operator (
SELECT column
FROM table
WHERE condition
);
- The inner query (in parentheses) executes first
- Its result is passed to the outer query
- The outer query uses that result to filter or calculate
🔍 Behind the Scenes: How Subqueries Execute
When you run a query with a subquery:
- Step 1: Database executes the innermost subquery first
- Step 2: Result is temporarily stored in memory
- Step 3: Outer query uses this result as if it were a literal value
- Step 4: Final results are returned to you
Think of it like solving math: solve what's in parentheses first, then use that answer!
WHERE Subqueries
Filter rows based on results from another query
Use: "Employees earning above average"
FROM Subqueries
Use query results as a temporary table
Use: "Derived tables for complex logic"
SELECT Subqueries
Calculate values for each row
Use: "Show each employee's salary vs department avg"
Correlated Subqueries
Reference outer query in inner query
Use: "Compare each row to its group"
EXISTS Subqueries
Check if any rows exist
Use: "Employees who have projects"
IN Subqueries
Check if value matches any result
Use: "Products in specific categories"
Simple Subqueries in WHERE
The most common type of subquery appears in the WHERE clause, filtering rows based on a calculated value.
Pattern: Single-Value Subquery
SELECT columns
FROM table
WHERE column > (SELECT AVG(column) FROM table);
The subquery must return a single value when using comparison operators (=, >, <, etc.)
Try It: Employees Above Average Salary
Find all employees who earn more than the average salary:
🔍 What Just Happened?
Let's break down the execution:
-- Step 1: Inner query calculates average
SELECT AVG(salary) FROM employees;
-- Returns: 87000 (example)
-- Step 2: Outer query uses this value
SELECT name, email, salary
FROM employees
WHERE salary > 87000 -- The calculated average
ORDER BY salary DESC;
The subquery runs once, returns one value, and the main query uses it!
Try It: Projects Above Average Budget
Find projects with budgets higher than the average:
Try It: Highest Salary
Find employees who earn the maximum salary:
Real-World Example: E-commerce Analytics
-- Products priced above average
SELECT product_name, price
FROM products
WHERE price > (SELECT AVG(price) FROM products);
-- Orders larger than average order value
SELECT order_id, customer_id, order_total
FROM orders
WHERE order_total > (SELECT AVG(order_total) FROM orders)
ORDER BY order_total DESC;
-- Customers who spent more than average
SELECT customer_id, name, total_spent
FROM customers
WHERE total_spent > (SELECT AVG(total_spent) FROM customers);
Subqueries with IN Operator
The IN operator lets you check if a value matches any value in a list
returned by a subquery. Perfect when the subquery returns multiple values!
IN Subquery Syntax:
SELECT columns
FROM table
WHERE column IN (
SELECT column
FROM another_table
WHERE condition
);
The subquery can return multiple rows, and IN checks if the value matches any of them.
Real-Life Analogy: The Guest List
Imagine checking if someone is on the VIP list:
- Subquery: "Get all VIP names" → [Alice, Bob, Carol]
- Main Query: "Is this person IN the VIP list?"
- Result: True if their name matches any VIP, False otherwise
Try It: Employees in Specific Departments
Find employees in departments with more than 2 people:
Try It: High Earners in Engineering
Find employees in Engineering earning above company average:
Try It: NOT IN - Employees Below Average
Find employees earning below the company average salary:
🔍 Watch Out for NULL Values with NOT IN!
NOT IN behaves unexpectedly with NULL values:
-- If subquery returns: (1, 2, NULL)
WHERE id NOT IN (1, 2, NULL)
-- This returns NO ROWS!
-- Why? NOT IN is equivalent to:
WHERE id != 1 AND id != 2 AND id != NULL
-- Comparing to NULL always returns NULL (not TRUE or FALSE)
-- Solution: Filter out NULLs in the subquery
WHERE id NOT IN (
SELECT employee_id
FROM project_assignments
WHERE employee_id IS NOT NULL -- Important!
)
Real-World Example: Customer Segmentation
-- Customers who made purchases this month
SELECT customer_id, name, email
FROM customers
WHERE customer_id IN (
SELECT DISTINCT customer_id
FROM orders
WHERE order_date >= DATE('now', 'start of month')
);
-- Products that were never ordered
SELECT product_id, product_name, price
FROM products
WHERE product_id NOT IN (
SELECT DISTINCT product_id
FROM order_items
WHERE product_id IS NOT NULL
);
-- High-value customers (lifetime value > $1000)
SELECT name, email
FROM customers
WHERE customer_id IN (
SELECT customer_id
FROM orders
GROUP BY customer_id
HAVING SUM(order_total) > 1000
);
Subqueries with Comparison Operators
You can use ANY and ALL keywords with comparison operators to compare
against multiple values from a subquery.
ANY and ALL Syntax:
-- ANY: True if comparison is true for ANY value
WHERE column > ANY (subquery)
-- ALL: True if comparison is true for ALL values
WHERE column > ALL (subquery)
> ANY: Greater than at least one value (similar to > MIN)> ALL: Greater than every value (same as > MAX)= ANY: Equal to any value (same as IN)
Real-Life Analogy: The Competition
Imagine comparing your test score to others:
- > ANY: "Did I score higher than at least one person?" (beat the minimum)
- > ALL: "Did I score higher than everyone?" (beat the maximum)
Try It: Salary Greater Than ANY in Sales
Find non-Sales employees earning more than ANY Sales employee:
Try It: Salary Greater Than ALL in Sales
Find employees earning more than ALL Sales employees:
Equivalent Queries:
Using ANY/ALL
-- Greater than ANY
WHERE salary > ANY (...)
-- Greater than ALL
WHERE salary > ALL (...)
Using MIN/MAX
-- Same as > ANY
WHERE salary > (SELECT MIN(salary) ...)
-- Same as > ALL
WHERE salary > (SELECT MAX(salary) ...)
MIN/MAX is usually clearer and more commonly used!
Try It: Alternative with MIN/MAX
Same results using MIN and MAX (more readable):
EXISTS Operator
The EXISTS operator checks if a subquery returns any rows. It returns
TRUE if the subquery finds at least one matching row, FALSE otherwise.
EXISTS Syntax:
SELECT columns
FROM table1
WHERE EXISTS (
SELECT 1
FROM table2
WHERE table2.id = table1.id
);
- Returns TRUE if subquery finds any rows
- Returns FALSE if subquery finds no rows
- More efficient than IN for large datasets
- Often used with correlated subqueries
Real-Life Analogy: The Membership Check
Imagine checking gym memberships:
- Question: "Does this person have an active membership?"
- EXISTS: Look for ANY active membership record
- Result: TRUE if found (even one), FALSE if not found
EXISTS doesn't care HOW MANY matches exist, just whether ANY exist!
Try It: Employees with High Salaries
Find employees earning above $85,000:
Try It: Active Projects List
Find projects with status 'Active':
🔍 EXISTS vs IN: Performance Difference
EXISTS is often faster than IN because it stops as soon as it finds ONE match:
-- EXISTS: Stops at first match
WHERE EXISTS (
SELECT 1
FROM orders
WHERE orders.customer_id = customers.id
)
-- Finds first order, stops immediately, returns TRUE
-- IN: Must get all values
WHERE customer_id IN (
SELECT customer_id
FROM orders
)
-- Must get ALL customer_ids, then check membership
Use EXISTS for "does any exist?" and IN for "is in this specific list?"
Why SELECT 1 in EXISTS?
You'll often see SELECT 1 in EXISTS subqueries:
WHERE EXISTS (
SELECT 1 -- Could be *, NULL, or any value
FROM table
WHERE condition
)
EXISTS only checks if rows exist, not what data they contain. SELECT 1 is a
convention meaning "we don't care about the actual values."
Try It: Top Earners Above Average
Find employees earning more than the overall average:
Real-World Example: Customer Activity
-- Active customers (made at least one purchase this year)
SELECT customer_id, name, email
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND STRFTIME('%Y', o.order_date) = '2024'
);
-- Products with reviews
SELECT product_id, product_name, price
FROM products p
WHERE EXISTS (
SELECT 1
FROM reviews r
WHERE r.product_id = p.product_id
);
-- Customers who bought premium products
SELECT name, email
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE o.customer_id = c.customer_id
AND p.category = 'Premium'
);
NOT EXISTS
NOT EXISTS is the opposite of EXISTS. It returns TRUE when the subquery finds no
rows, making it perfect for finding missing relationships.
NOT EXISTS Syntax:
SELECT columns
FROM table1
WHERE NOT EXISTS (
SELECT 1
FROM table2
WHERE table2.id = table1.id
);
Returns rows from table1 that have NO matching rows in table2.
Try It: Employees Earning Below Average
Find employees earning below the company average:
Try It: Projects Without Assignments
Find projects that have no employees assigned:
NOT EXISTS vs NOT IN
NOT EXISTS (Recommended)
WHERE NOT EXISTS (
SELECT 1
FROM table2
WHERE table2.id = table1.id
)
✅ Handles NULL correctly
✅ Often faster
✅ Stops at first match
NOT IN (Watch for NULLs!)
WHERE id NOT IN (
SELECT id
FROM table2
WHERE id IS NOT NULL
)
⚠️ Breaks with NULL
⚠️ Must get all values
⚠️ Needs NULL filter
Recommendation: Prefer NOT EXISTS over NOT IN for better NULL handling and performance!
Try It: Completed Projects Only
Find projects that are completed (not active):
Real-World Example: Finding Gaps
-- Inactive customers (no orders in the last 6 months)
SELECT customer_id, name, email, last_login
FROM customers c
WHERE NOT EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.order_date >= DATE('now', '-6 months')
);
-- Products never purchased
SELECT product_id, product_name, price, stock
FROM products p
WHERE NOT EXISTS (
SELECT 1
FROM order_items oi
WHERE oi.product_id = p.product_id
);
-- Employees who never attended training
SELECT employee_id, name, hire_date
FROM employees e
WHERE NOT EXISTS (
SELECT 1
FROM training_attendance t
WHERE t.employee_id = e.employee_id
)
AND hire_date < DATE('now', '-90 days');
Subqueries in FROM Clause (Derived Tables)
You can use a subquery in the FROM clause to create a derived table (also called an inline view). This temporary result set can be queried like a regular table.
Derived Table Syntax:
SELECT columns
FROM (
SELECT columns
FROM table
WHERE condition
) AS derived_table_name
WHERE another_condition;
- Subquery creates a temporary table
- Must give it an alias (AS name)
- Can query it like any other table
- Great for multi-step logic
Real-Life Analogy: The Intermediate Step
Think of it like cooking:
- Step 1: Chop vegetables (subquery creates derived table)
- Step 2: Cook the chopped vegetables (main query uses derived table)
You can't cook un-chopped vegetables! The derived table is your prepared ingredients.
Try It: Department Salary Statistics
Get statistics about departments, then filter them:
Try It: Top Earners Per Department
Find the highest earner in each department:
Try It: Ranking and Filtering
Create numbered rows, then filter by that number:
🔍 Why Use Derived Tables?
Derived tables help you:
- Break complex logic into steps - Easier to read and debug
- Filter aggregate results - WHERE can't filter aggregates, but you can filter the derived table
- Join aggregated data - Join summary statistics back to detail rows
- Reuse calculations - Calculate once, use multiple times
Real-World Example: Sales Analysis
-- Monthly sales with year-over-year comparison
SELECT
current.month,
current.revenue AS current_year_revenue,
previous.revenue AS previous_year_revenue,
ROUND(100.0 * (current.revenue - previous.revenue) / previous.revenue, 2) AS growth_pct
FROM (
SELECT
STRFTIME('%m', order_date) AS month,
SUM(order_total) AS revenue
FROM orders
WHERE STRFTIME('%Y', order_date) = '2024'
GROUP BY STRFTIME('%m', order_date)
) AS current
LEFT JOIN (
SELECT
STRFTIME('%m', order_date) AS month,
SUM(order_total) AS revenue
FROM orders
WHERE STRFTIME('%Y', order_date) = '2023'
GROUP BY STRFTIME('%m', order_date)
) AS previous ON current.month = previous.month
ORDER BY current.month;
Subqueries in SELECT Clause (Scalar Subqueries)
A scalar subquery in the SELECT clause returns a single value for each row in the result set. It's like adding a calculated column based on related data.
Scalar Subquery Syntax:
SELECT
column1,
column2,
(SELECT calculation FROM table2 WHERE condition) AS calculated_column
FROM table1;
- Must return exactly ONE value (single row, single column)
- Executes for each row in the outer query
- Useful for adding related calculations
Try It: Employee Salary vs Department Average
Show each employee's salary compared to their department average:
Try It: Project with Team Count
Show each project with its team size:
Try It: Multiple Calculated Columns
Add several subquery calculations:
🔍 Performance Warning: SELECT Subqueries Can Be Slow
SELECT subqueries execute once per row:
-- If you have 1000 employees, this runs 1001 queries!
SELECT
name,
(SELECT AVG(salary) FROM employees WHERE department_id = e.department_id)
FROM employees e;
-- 1 outer query + 1000 subqueries = slow!
-- BETTER: Use JOIN to pre-calculate
SELECT
e.name,
dept_avg.avg_salary
FROM employees e
JOIN (
SELECT department_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id
) dept_avg ON e.department_id = dept_avg.department_id;
Real-World Example: Customer Dashboard
-- Customer summary with related calculations
SELECT
c.customer_id,
c.name,
c.email,
(SELECT COUNT(*)
FROM orders o
WHERE o.customer_id = c.customer_id) AS order_count,
(SELECT COALESCE(SUM(order_total), 0)
FROM orders o
WHERE o.customer_id = c.customer_id) AS lifetime_value,
(SELECT MAX(order_date)
FROM orders o
WHERE o.customer_id = c.customer_id) AS last_order_date,
(SELECT AVG(order_total)
FROM orders o
WHERE o.customer_id = c.customer_id) AS avg_order_value,
CASE
WHEN (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) > 10
THEN 'VIP'
WHEN (SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.customer_id) > 5
THEN 'Regular'
ELSE 'New'
END AS customer_tier
FROM customers c
ORDER BY lifetime_value DESC
LIMIT 20;
Performance Considerations
Subqueries are powerful but can impact performance. Understanding when and how to optimize them is crucial for working with large datasets.
✅ Good: Simple Subqueries
Uncorrelated subqueries that run once
Example: WHERE salary > (SELECT AVG(salary) FROM employees)
✅ Good: EXISTS/NOT EXISTS
Stops at first match, very efficient
Example: WHERE EXISTS (SELECT 1...)
⚠️ Careful: Correlated Subqueries
Runs once per row, can be slow
Tip: Consider JOIN alternative
⚠️ Careful: SELECT Subqueries
Executes for every row returned
Tip: Use derived tables instead
❌ Avoid: NOT IN with NULLs
Can return unexpected results
Use: NOT EXISTS instead
🚀 Optimize: Use Indexes
Index columns used in subquery joins
Benefit: Dramatic speed improvement
Optimization Examples
❌ Slow Correlated Subquery
SELECT name, salary
FROM employees e1
WHERE salary > (
SELECT AVG(salary)
FROM employees e2
WHERE e2.dept = e1.dept
);
Runs subquery for each employee
✅ Fast JOIN Alternative
SELECT e.name, e.salary
FROM employees e
JOIN (
SELECT dept, AVG(salary) AS avg
FROM employees
GROUP BY dept
) d ON e.dept = d.dept
WHERE e.salary > d.avg;
Calculates averages once, then joins
🔍 Query Execution Analysis
To understand performance, consider:
- How many times does the subquery execute? Once or once per row?
- How many rows does it scan? Can indexes help?
- Can it be rewritten as a JOIN? JOINs are often optimized better
- Does it need all the data? LIMIT can help scalar subqueries
When to Use Each Type:
| Scenario | Best Choice | Why |
|---|---|---|
| Single calculated value | Simple subquery | Runs once, very fast |
| Check if relationship exists | EXISTS | Stops at first match |
| Match against list | IN (small lists) | Clear and simple |
| Match against large list | EXISTS or JOIN | Better performance |
| Multi-step calculations | Derived table (FROM) | Break logic into steps |
| Per-row calculations | JOIN to derived table | Calculate once, not per row |
Practice Exercises
Master subqueries with these challenging exercises!
Exercise 1: Above Average Employees
Find all employees earning more than the company-wide average salary. Show name, department, and salary.
Exercise 2: Employees in Large Departments
Find employees who work in departments with more than 4 people. Use IN and a subquery.
Exercise 3: Department Top Earners
Find employees who earn more than their department's average. Use a correlated subquery.
Exercise 4: Employees with Projects (EXISTS)
Use EXISTS to find employees who are assigned to at least one project.
Exercise 5: Employees Without Projects (NOT EXISTS)
Use NOT EXISTS to find employees with no project assignments.
Exercise 6: Highest Earner Per Department
Use a derived table to find the employee with the highest salary in each department.
Exercise 7: Salary Comparison Column
Add a column showing each employee's salary compared to the company average. Use a SELECT subquery.
Exercise 8: Complex Multi-Subquery Challenge
Find departments where the average salary is above the company average AND have at least 3 employees. Show department, employee count, and average salary.
Knowledge Check Quiz
Test your understanding of subqueries:
1. What is a subquery?
2. When does a simple (uncorrelated) subquery execute?
3. What does EXISTS check?
4. What's the main difference between a simple and correlated subquery?
5. Why is NOT EXISTS preferred over NOT IN?
6. What must you do when using a subquery in FROM?
7. A scalar subquery in SELECT must return:
8. Which is generally fastest for large datasets?