SQL Question Categories


SQL Categories

1 Computing Ratios

  • Compute ratio / percentage
  • Given user activities history / system logs
  • Examples
    • Calculate click through rate of an advertisement
    • Calculate retention rate of users
    • Query the percentage of users / product satisfying certain criteria

In this category of problem, we will be given a table of user activities / system logs, and we will need to write queries to aggregate the numerator and the denominator.

LeetCode 1173

Immediate Food Delivery I (Easy) [link]

If we want to find the percentage of immediate orders among all unique orders, considering all orders of each customer:

SELECT ROUND(
    (
        SELECT COUNT(*)
    FROM Delivery
    WHERE order_date = customer_pref_delivery_date
    ) / (
        SELECT COUNT(*)
        FROM Delivery
    ) * 100, 2
) AS immediate_percentage

LeetCode 1174

Immediate Food Delivery II (Medium) [link]

If we want to find the percentage of immediate orders among unique customer, only considering the first order ( the order with the earliest order date) of each customer:

WITH first_order AS(
    SELECT
      customer_id
  FROM Delivery
  GROUP BY customer_id
  HAVING MIN(order_date) = MIN(customer_pref_delivery_date)
);

SELECT
    ROUND(
      CAST(COUNT(customer_id) AS DECIMAL)
    / 
    (SELECT COUNT(DISTINCT customer_id) FROM Delivery) * 100
    , 2) AS immediate_percentage
FROM first_order;

All in one query

SELECT
    CAST(COUNT(customer_id) AS DECIMAL)
    FROM(
    SELECT
      customer_id
    FROM Delivery
    GROUP BY customer_id
    HAVING MIN(order_date) = MIN(customer_pref_delivery_date)
  ) 
  / 
  (SELECT COUNT(DISTINCT customer_id) FROM Delivery) * 100
  , 2) AS immediate_percentage
FROM first_order;

LeetCode 1211

Queries Quality and Percentage (Easy) [link]

First thought

SELECT query_name, 
    ROUND(AVG(ratio), 2) AS quality, 
    ROUND(AVG(poor_query) * 100, 2)  AS poor_query_percentage
FROM(
    SELECT *, 
        rating/position AS ratio,
        IF(rating < 3, 1, 0) AS poor_query
    FROM Queries
) t
GROUP BY query_name;

Simplified version

SELECT query_name,
    ROUND(AVG(rating/position), 2) AS quality,
    ROUND(AVG(IF(rating < 3, 1, 0)) * 100, 2) AS poor_query_percentage
FROM Queries
GROUP BY query_name;

LeetCode 1322

Ads Performance (Easy) [link]

SELECT DISTINCT ad_id, IFNULL(
    ROUND(
      SUM(action = 'Clicked') / (SUM(action = 'Clicked') + SUM(action = 'Viewed')) * 100
    , 2)
  , 0)
) AS ctr
FROM Ads
GROUP BY ad_id
ORDER BY ctr DESC, ad_id;

2 Data Categorization

Organize numerical data into categorical data

  • Generate a summary
  • Transform data into a more intuitive format
  • Examples
    • Numerical values -> Low, Medium, High
    • Monthly / yearly profit -> Net gain / loss

This type of question requires two steps:

  1. Use CASE WHEN to transform numerical data to categorical data
  2. Apply aggregation function to obtain the statistics

LeetCode 1212

Team Scores in Football Tournament (Medium) [link]

Using IF

# Write your MySQL query statement below
SELECT Teams.team_id, Teams.team_name, 
    IFNULL(SUM(tmp.num_points),0) as num_points
FROM Teams
LEFT JOIN (
    SELECT host_team AS team_id, 
        IF(host_goals > guest_goals, 3, 
                IF(host_goals = guest_goals, 1, 0)) AS num_points
    FROM Matches
    UNION ALL
    SELECT guest_team AS team_id,
        IF(host_goals > guest_goals, 0, 
            IF(host_goals = guest_goals, 1, 3)) AS num_points
    FROM Matches
) tmp
ON Teams.team_id = tmp.team_id
GROUP BY team_id
ORDER BY num_points DESC, team_id;

Using CASE_WHEN

SELECT team_id, team_name, num_points FROM (
    SELECT Teams.team_id, Teams.team_name, SUM(
        CASE
            WHEN Teams.team_id = Matches.host_team THEN
                CASE
                    WHEN Matches.host_goals > Matches.guest_goals THEN 3
                    WHEN Matches.host_goals = Matches.guest_goals THEN 1
                    ELSE 0
                END
            WHEN Teams.team_id = Matches.guest_team THEN
                CASE
                    WHEN Matches.host_goals < Matches.guest_goals THEN 3
                    WHEN Matches.host_goals = Matches.guest_goals THEN 1
                    ELSE 0
                END
            ELSE 0
        END
    ) AS num_points
        FROM Teams LEFT JOIN Matches
        ON Teams.team_id = Matches.host_team OR Teams.team_id = Matches.guest_team
        GROUP BY Teams.team_id
) result
    ORDER BY result.num_points DESC, result.team_id;

LeetCode 1393

Capital Gain Loss (Medium) [link]

SELECT stock_name, SUM(IF(operation = 'Buy', -price, price)) AS capital_gain_loss
FROM Stocks
GROUP BY stock_name;

LeetCode 1435

Create a Session Bar Chart (Easy) [link]

The following solution is problematic since we don’t consider edge case, that it is possible that there is no session falls within certain interval.

SELECT
    CASE
        WHEN duration < 300 THEN '[0-5>'
        WHEN duration < 600 THEN '[5-10>'
        WHEN duration < 900 THEN '[10-15>'
        ELSE '15 or more'
    END AS bin,
    COUNT(*) AS total
FROM Sessions
GROUP BY 1;

A better solution is

SELECT t1.bin, IFNULL(t2.total,0) AS total
FROM (
    SELECT '[0-5>' AS bin
    UNION
    SELECT '[5-10>'
    UNION
    SELECT '[10-15>'
    UNION
    SELECT '15 or more'
) as t1
LEFT JOIN (
    SELECT
        CASE 
            WHEN duration < 300 THEN '[0-5>'
            WHEN duration < 600 THEN '[5-10>'
            WHEN duration < 900 THEN '[10-15>'
            ELSE '15 or more'
        END AS bin,
        COUNT(*) AS total
    FROM Sessions
    GROUP BY 1
) t2
ON t1.bin = t2.bin;

Alternative solution

SELECT '[0-5>' AS bin, COUNT(session_id) AS total FROM Sessions WHERE round(duration / 60, 2) >= 0 AND round(duration / 60, 2) < 5
UNION
SELECT '[5-10>' AS bin, COUNT(session_id) AS total FROM Sessions WHERE round(duration / 60, 2) >= 5 AND round(duration / 60, 2) < 10
UNION
SELECT '[10-15>' AS bin, COUNT(session_id) AS total FROM Sessions WHERE round(duration / 60, 2) >= 10 AND round(duration / 60, 2) < 15
UNION
SELECT '15 or more' AS bin, COUNT(session_id) AS total FROM Sessions WHERE round(duration / 60, 2) >= 15;

LeetCode 1468

Calculate Salaries (Medium) [link]

SELECT s.company_id, s.employee_id, s.employee_name, (
    ROUND(CASE
               WHEN(SELECT MAX(t.salary) FROM Salaries t WHERE t.company_id = s.company_id) < 1000 THEN s.salary
               WHEN(SELECT MAX(t.salary) FROM Salaries t WHERE t.company_id = s.company_id) BETWEEN 1000 AND 10000 THEN s.salary * (1-0.24)
               ELSE s.salary * (1-0.49)
       END, 0)
) AS salary
FROM Salaries s;

Using a tmp table

SELECT s1.company_id, 
    s1.employee_id, 
    s1.employee_name,
    ROUND(s1.salary * tmp.without_tax) AS salary
FROM Salaries s1
LEFT JOIN (
    SELECT company_id, 
        IF(MAX(salary) < 1000, 1,
            IF(MAX(salary) < 10000, 0.76, 0.51)) AS without_tax
    FROM Salaries
    GROUP BY company_id
) tmp
ON s1.company_id = tmp.company_id;

Using window function

# Write your MySQL query statement below
SELECT company_id, employee_id, employee_name, 
    ROUND(
        CASE 
            WHEN MAX(salary) OVER(PARTITION BY company_id) < 1000 THEN salary
            WHEN MAX(salary) OVER(PARTITION BY company_id) BETWEEN 1000 and 10000 THEN salary * (1-0.24)
            WHEN MAX(salary) OVER(PARTITION BY company_id) > 10000 THEN salary*(1-0.49)
        END
    ) AS salary
FROM Salaries

LeetCode 1907

Count Salary Categories (Medium) [link]

SELECT 'Low Salary' AS category, COUNT(account_id) AS accounts_count FROM Accounts WHERE income < 20000
UNION
SELECT 'Average Salary' AS category, COUNT(account_id) AS accounts_count FROM Accounts WHERE income BETWEEN 20000 AND 50000
UNION
SELECT 'High Salary' AS category, COUNT(account_id) AS accounts_count FROM Accounts WHERE income > 50000;

Alternative solution

SELECT t.category, IFNULL(a.cnt, 0) AS accounts_count
FROM (
    SELECT 'Low Salary' category
    UNION
    SELECT 'Average Salary'
    UNION
    SELECT 'High Salary'
) AS t
LEFT JOIN (
    SELECT
        CASE WHEN income < 20000 THEN 'Low Salary'
             WHEN income > 50000 THEN 'High Salary'
             ELSE 'Average Salary'
        END
    AS category, COUNT(1) AS cnt
    FROM Accounts
    GROUP BY 1
) AS a
ON t.category = a.category;

Note that we are able to GROUP BY 1 the first newly created column. COUNT(1) is equal to COUNT(*).

3 Cumulative Sums

Often associated with time series data. Cumulative sum of the metric up till a particular date. For example,

  • Cumulative number of sales
  • Cumulative sales in a 7 day time window

LeetCode 534

Game Play Analysis I (Medium) [link]

Using the one-to-many mapping by JOIN:

SELECT a.player_id, a.event_date, SUM(b.games_played) AS games_played_so_far
FROM activity a
JOIN activity b
ON a.player_id = b. player_id
WHERE a.event_date >= b.event_date
GROUP BY a.player_id, b.event_date
ORDER BY a.player_id;

Using window functions:

SELECT player_id, event_date, 
    SUM(games_played) OVER(PARTITION BY player_id ORDER BY event_date) AS games_played_so_far
FROM Activity;

LeetCode 579

Find Cumulative Salary of an Employee (Hard) [link]

SELECT temp2.Id, temp2.Month, (
    SELECT SUM(e2.Salary) FROM Employee e2
        WHERE e2.Id = temp2.Id AND e2.Month BETWEEN temp2.Month - 2 AND temp2.Month
        ORDER BY e2.Month DESC 
        LIMIT 3
) AS Salary 
FROM (
    SELECT e1.Id, e1.Month, e1.Salary
    FROM Employee e1, (
        SELECT e.Id, MAX(e.Month) AS Month
        FROM Employee e
        GROUP BY e.Id
    ) temp1
    WHERE e1.Id = temp1.Id AND e1.Month < temp1.Month
    ORDER BY e1.Id ASC, e1.Month DESC
) temp2;

This is how to exclude the most recent month

SELECT e1.Id, e1.Month, e1.Salary
FROM Employee e1, (
    SELECT e.Id, MAX(e.Month) AS Month
    FROM Employee e
    GROUP BY e.Id
) temp1
WHERE e1.Id = temp1.Id AND e1.Month < temp1.Month
ORDER BY e1.Id ASC, e1.Month DESC;

Using window function rank() and RANGE 2 PRECEDING (faster).

SELECT id, month, Salary
FROM(
    SELECT id, month,
        SUM(salary) OVER(PARTITION BY id ORDER BY month RANGE 2 PRECEDING) AS Salary,
        rank() OVER(PARTITION BY id ORDER BY month DESC) AS rnk
    FROM Employee
) t
WHERE rnk > 1
ORDER BY id, month DESC;

4 Self Join

LeetCode 1270

All People Report to the Given Manager (Medium) [link]

Union the selected employees at each of the three stages.

1] An employee directly reports to manager 1

2] An employee reports to managerA, managerA reports to manager 1.

3] An employee reports to managerB, managerB reports to managerA, managerA reports to manager 1.

One method:

SELECT a.employee_id AS employee_id
FROM
    Employees AS a
    LEFT JOIN
    Employees AS b ON a.manager_id = b.employee_id
    LEFT JOIN
    Employees AS c ON b.manager_id = c.employee_id
    LEFT JOIN
    Employees AS d ON c.manager_id = d.employee_id
WHERE
    a.employee_id != 1 AND d.employee_id = 1;

Explanation:

By joining by mapping the employee id of the next table to the manager id of the previous table, we keep track of the IDs that are able to report to manager 1. By selecting a.employee_id !=1 and d.employee_id = 1, we select all IDs that are able to report to manager 1 within three steps.

SELECT 
    a.employee_id AS AID, 
    a.manager_id AS Ama,
    b.employee_id AS BID,
    b.manager_id AS Bma,
    c.employee_id AS CID,
    c.manager_id AS Cma,
    d.employee_id AS DID,
    d.manager_id AS Dma
FROM
Employees as a
left join
Employees as b on a.manager_id = b.employee_id
left join
Employees as c on b.manager_id = c.employee_id
left join
Employees as d on c.manager_id = d.employee_id

# +-----+-----+------+------+------+------+------+------+
# | AID | Ama | BID  | Bma  | CID  | Cma  | DID  | Dma  |
# +-----+-----+------+------+------+------+------+------+
# |   1 |   1 |    1 |    1 |    1 |    1 |    1 |    1 |
# |   2 |   1 |    1 |    1 |    1 |    1 |    1 |    1 |
# |  77 |   1 |    1 |    1 |    1 |    1 |    1 |    1 |
# |   4 |   2 |    2 |    1 |    1 |    1 |    1 |    1 |
# |   7 |   4 |    4 |    2 |    2 |    1 |    1 |    1 |
# |   3 |   3 |    3 |    3 |    3 |    3 |    3 |    3 |
# |   8 |   3 |    3 |    3 |    3 |    3 |    3 |    3 |
# |   9 |   8 |    8 |    3 |    3 |    3 |    3 |    3 |
# +-----+-----+------+------+------+------+------+------+

Alternative method (same idea):

SELECT employee_id AS EMPLOYEE_ID 
FROM Employees 
WHERE manager_id IN (
    SELECT employee_id 
    FROM Employees 
    WHERE manager_id IN (
        SELECT employee_id 
        FROM Employees 
        WHERE manager_id = 1
    )
) AND employee_id != 1

LeetCode 1811

Find Interview Candidates (Medium) [link]

WHERE Condition1 OR Condition2; Advanced case of using WHERE to filter table considering multiple tables.

SELECT u.name, u.mail 
FROM Users u 
WHERE(
    SELECT COUNT(c1.contest_id)
    FROM Contests c1, Contests c2, Contests c3
    WHERE c2.contest_id - c1.contest_id = 1 
        AND c3.contest_id - c2.contest_id = 1
        AND (c1.gold_medal = u.user_id OR c1.silver_medal = u.user_id OR c1.bronze_medal = u.user_id)
        AND (c2.gold_medal = u.user_id OR c2.silver_medal = u.user_id OR c2.bronze_medal = u.user_id)
        AND (c3.gold_medal = u.user_id OR c3.silver_medal = u.user_id OR c3.bronze_medal = u.user_id)
    ) >= 1 
OR(
    SELECT COUNT(*) 
    FROM Contests c 
    WHERE c.gold_medal = u.user_id
) >= 3;

Note that this is how to find consecutive medals

SELECT COUNT(c1.contest_id)
FROM Contests c1, Contests c2, Contests c3
WHERE c2.contest_id - c1.contest_id = 1 
    AND c3.contest_id - c2.contest_id = 1
    AND (c1.gold_medal = u.user_id OR c1.silver_medal = u.user_id OR c1.bronze_medal = u.user_id)
    AND (c2.gold_medal = u.user_id OR c2.silver_medal = u.user_id OR c2.bronze_medal = u.user_id)
    AND (c3.gold_medal = u.user_id OR c3.silver_medal = u.user_id OR c3.bronze_medal = u.user_id)

LeetCode 1285

Find the Start and End Number of Continuous Ranges (Medium) [link]

First idea: use dense_rank() to determine the range and then determine the start and end id in each range.

with startid as (
    SELECT log_id AS start_id
    FROM (
        SELECT log_id, 
            dense_rank() OVER(PARTITION BY rnk_diff ORDER BY log_id DESC) AS desc_rnk,
            dense_rank() OVER(PARTITION BY rnk_diff ORDER BY log_id) AS asc_rnk
        FROM(
            SELECT log_id,
                log_id - dense_rank() OVER(ORDER BY log_id) AS rnk_diff
            FROM Logs
        ) t
    ) tt
    WHERE asc_rnk = 1
),

endid as (
    SELECT log_id AS end_id
    FROM (
        SELECT log_id, 
            dense_rank() OVER(PARTITION BY rnk_diff ORDER BY log_id DESC) AS desc_rnk,
            dense_rank() OVER(PARTITION BY rnk_diff ORDER BY log_id) AS asc_rnk
        FROM(
            SELECT log_id,
                log_id - dense_rank() OVER(ORDER BY log_id) AS rnk_diff
            FROM Logs
        ) t
    ) tt
    WHERE desc_rnk = 1
)

SELECT startid.start_id as start_id, MIN(endid.end_id) as end_id
FROM startid, endid
WHERE startid.start_id <= endid.end_id
GROUP BY startid.start_id;

For each value log in table Logs, if log - 1 does not exist in Logs, then log is the start id of an interval. If log + 1 does not exist in Logs, then log is the end id of an interval. Join the start ids and the end ids to obtain the result.

Note that we need to group by start id and select the min end id, since after (one-to-many) joining start id and end id, for each start id, there are multiple end ids. We need the smallest one that’s larger than the start id.

SELECT log_start.log_id AS START_ID, MIN(log_end.log_id) AS END_ID 
FROM 
    (SELECT log_id FROM Log WHERE log_id - 1 NOT IN (SELECT * FROM Log)) log_start,
    (SELECT log_id FROM Log WHERE log_id + 1 NOT IN (SELECT * FROM Log)) log_end
WHERE log_start.log_id <= log_end.log_id
GROUP BY log_start.log_id;

  TOC