SQL Categories
1 Computing Ratios
- Compute ratio / percentage
- Given user activities history / system logs
- Examples
- Calculate click through rate of an advertisement
- Calculate retention rate of users
- Query the percentage of users / product satisfying certain criteria
In this category of problem, we will be given a table of user activities / system logs, and we will need to write queries to aggregate the numerator and the denominator.
LeetCode 1173
Immediate Food Delivery I (Easy) [link]
If we want to find the percentage of immediate orders among all unique orders, considering all orders of each customer:
SELECT ROUND(
(
SELECT COUNT(*)
FROM Delivery
WHERE order_date = customer_pref_delivery_date
) / (
SELECT COUNT(*)
FROM Delivery
) * 100, 2
) AS immediate_percentage
LeetCode 1174
Immediate Food Delivery II (Medium) [link]
If we want to find the percentage of immediate orders among unique customer, only considering the first order ( the order with the earliest order date) of each customer:
WITH first_order AS(
SELECT
customer_id
FROM Delivery
GROUP BY customer_id
HAVING MIN(order_date) = MIN(customer_pref_delivery_date)
);
SELECT
ROUND(
CAST(COUNT(customer_id) AS DECIMAL)
/
(SELECT COUNT(DISTINCT customer_id) FROM Delivery) * 100
, 2) AS immediate_percentage
FROM first_order;
All in one query
SELECT
CAST(COUNT(customer_id) AS DECIMAL)
FROM(
SELECT
customer_id
FROM Delivery
GROUP BY customer_id
HAVING MIN(order_date) = MIN(customer_pref_delivery_date)
)
/
(SELECT COUNT(DISTINCT customer_id) FROM Delivery) * 100
, 2) AS immediate_percentage
FROM first_order;
LeetCode 1211
Queries Quality and Percentage (Easy) [link]
First thought
SELECT query_name,
ROUND(AVG(ratio), 2) AS quality,
ROUND(AVG(poor_query) * 100, 2) AS poor_query_percentage
FROM(
SELECT *,
rating/position AS ratio,
IF(rating < 3, 1, 0) AS poor_query
FROM Queries
) t
GROUP BY query_name;
Simplified version
SELECT query_name,
ROUND(AVG(rating/position), 2) AS quality,
ROUND(AVG(IF(rating < 3, 1, 0)) * 100, 2) AS poor_query_percentage
FROM Queries
GROUP BY query_name;
LeetCode 1322
Ads Performance (Easy) [link]
SELECT DISTINCT ad_id, IFNULL(
ROUND(
SUM(action = 'Clicked') / (SUM(action = 'Clicked') + SUM(action = 'Viewed')) * 100
, 2)
, 0)
) AS ctr
FROM Ads
GROUP BY ad_id
ORDER BY ctr DESC, ad_id;
2 Data Categorization
Organize numerical data into categorical data
- Generate a summary
- Transform data into a more intuitive format
- Examples
- Numerical values -> Low, Medium, High
- Monthly / yearly profit -> Net gain / loss
This type of question requires two steps:
- Use CASE WHEN to transform numerical data to categorical data
- Apply aggregation function to obtain the statistics
LeetCode 1212
Team Scores in Football Tournament (Medium) [link]
Using IF
# Write your MySQL query statement below
SELECT Teams.team_id, Teams.team_name,
IFNULL(SUM(tmp.num_points),0) as num_points
FROM Teams
LEFT JOIN (
SELECT host_team AS team_id,
IF(host_goals > guest_goals, 3,
IF(host_goals = guest_goals, 1, 0)) AS num_points
FROM Matches
UNION ALL
SELECT guest_team AS team_id,
IF(host_goals > guest_goals, 0,
IF(host_goals = guest_goals, 1, 3)) AS num_points
FROM Matches
) tmp
ON Teams.team_id = tmp.team_id
GROUP BY team_id
ORDER BY num_points DESC, team_id;
Using CASE_WHEN
SELECT team_id, team_name, num_points FROM (
SELECT Teams.team_id, Teams.team_name, SUM(
CASE
WHEN Teams.team_id = Matches.host_team THEN
CASE
WHEN Matches.host_goals > Matches.guest_goals THEN 3
WHEN Matches.host_goals = Matches.guest_goals THEN 1
ELSE 0
END
WHEN Teams.team_id = Matches.guest_team THEN
CASE
WHEN Matches.host_goals < Matches.guest_goals THEN 3
WHEN Matches.host_goals = Matches.guest_goals THEN 1
ELSE 0
END
ELSE 0
END
) AS num_points
FROM Teams LEFT JOIN Matches
ON Teams.team_id = Matches.host_team OR Teams.team_id = Matches.guest_team
GROUP BY Teams.team_id
) result
ORDER BY result.num_points DESC, result.team_id;
LeetCode 1393
Capital Gain Loss (Medium) [link]
SELECT stock_name, SUM(IF(operation = 'Buy', -price, price)) AS capital_gain_loss
FROM Stocks
GROUP BY stock_name;
LeetCode 1435
Create a Session Bar Chart (Easy) [link]
The following solution is problematic since we don’t consider edge case, that it is possible that there is no session falls within certain interval.
SELECT
CASE
WHEN duration < 300 THEN '[0-5>'
WHEN duration < 600 THEN '[5-10>'
WHEN duration < 900 THEN '[10-15>'
ELSE '15 or more'
END AS bin,
COUNT(*) AS total
FROM Sessions
GROUP BY 1;
A better solution is
SELECT t1.bin, IFNULL(t2.total,0) AS total
FROM (
SELECT '[0-5>' AS bin
UNION
SELECT '[5-10>'
UNION
SELECT '[10-15>'
UNION
SELECT '15 or more'
) as t1
LEFT JOIN (
SELECT
CASE
WHEN duration < 300 THEN '[0-5>'
WHEN duration < 600 THEN '[5-10>'
WHEN duration < 900 THEN '[10-15>'
ELSE '15 or more'
END AS bin,
COUNT(*) AS total
FROM Sessions
GROUP BY 1
) t2
ON t1.bin = t2.bin;
Alternative solution
SELECT '[0-5>' AS bin, COUNT(session_id) AS total FROM Sessions WHERE round(duration / 60, 2) >= 0 AND round(duration / 60, 2) < 5
UNION
SELECT '[5-10>' AS bin, COUNT(session_id) AS total FROM Sessions WHERE round(duration / 60, 2) >= 5 AND round(duration / 60, 2) < 10
UNION
SELECT '[10-15>' AS bin, COUNT(session_id) AS total FROM Sessions WHERE round(duration / 60, 2) >= 10 AND round(duration / 60, 2) < 15
UNION
SELECT '15 or more' AS bin, COUNT(session_id) AS total FROM Sessions WHERE round(duration / 60, 2) >= 15;
LeetCode 1468
Calculate Salaries (Medium) [link]
SELECT s.company_id, s.employee_id, s.employee_name, (
ROUND(CASE
WHEN(SELECT MAX(t.salary) FROM Salaries t WHERE t.company_id = s.company_id) < 1000 THEN s.salary
WHEN(SELECT MAX(t.salary) FROM Salaries t WHERE t.company_id = s.company_id) BETWEEN 1000 AND 10000 THEN s.salary * (1-0.24)
ELSE s.salary * (1-0.49)
END, 0)
) AS salary
FROM Salaries s;
Using a tmp table
SELECT s1.company_id,
s1.employee_id,
s1.employee_name,
ROUND(s1.salary * tmp.without_tax) AS salary
FROM Salaries s1
LEFT JOIN (
SELECT company_id,
IF(MAX(salary) < 1000, 1,
IF(MAX(salary) < 10000, 0.76, 0.51)) AS without_tax
FROM Salaries
GROUP BY company_id
) tmp
ON s1.company_id = tmp.company_id;
Using window function
# Write your MySQL query statement below
SELECT company_id, employee_id, employee_name,
ROUND(
CASE
WHEN MAX(salary) OVER(PARTITION BY company_id) < 1000 THEN salary
WHEN MAX(salary) OVER(PARTITION BY company_id) BETWEEN 1000 and 10000 THEN salary * (1-0.24)
WHEN MAX(salary) OVER(PARTITION BY company_id) > 10000 THEN salary*(1-0.49)
END
) AS salary
FROM Salaries
LeetCode 1907
Count Salary Categories (Medium) [link]
SELECT 'Low Salary' AS category, COUNT(account_id) AS accounts_count FROM Accounts WHERE income < 20000
UNION
SELECT 'Average Salary' AS category, COUNT(account_id) AS accounts_count FROM Accounts WHERE income BETWEEN 20000 AND 50000
UNION
SELECT 'High Salary' AS category, COUNT(account_id) AS accounts_count FROM Accounts WHERE income > 50000;
Alternative solution
SELECT t.category, IFNULL(a.cnt, 0) AS accounts_count
FROM (
SELECT 'Low Salary' category
UNION
SELECT 'Average Salary'
UNION
SELECT 'High Salary'
) AS t
LEFT JOIN (
SELECT
CASE WHEN income < 20000 THEN 'Low Salary'
WHEN income > 50000 THEN 'High Salary'
ELSE 'Average Salary'
END
AS category, COUNT(1) AS cnt
FROM Accounts
GROUP BY 1
) AS a
ON t.category = a.category;
Note that we are able to GROUP BY 1
the first newly created column. COUNT(1)
is equal to COUNT(*)
.
3 Cumulative Sums
Often associated with time series data. Cumulative sum of the metric up till a particular date. For example,
- Cumulative number of sales
- Cumulative sales in a 7 day time window
LeetCode 534
Game Play Analysis I (Medium) [link]
Using the one-to-many mapping by JOIN
:
SELECT a.player_id, a.event_date, SUM(b.games_played) AS games_played_so_far
FROM activity a
JOIN activity b
ON a.player_id = b. player_id
WHERE a.event_date >= b.event_date
GROUP BY a.player_id, b.event_date
ORDER BY a.player_id;
Using window functions:
SELECT player_id, event_date,
SUM(games_played) OVER(PARTITION BY player_id ORDER BY event_date) AS games_played_so_far
FROM Activity;
LeetCode 579
Find Cumulative Salary of an Employee (Hard) [link]
SELECT temp2.Id, temp2.Month, (
SELECT SUM(e2.Salary) FROM Employee e2
WHERE e2.Id = temp2.Id AND e2.Month BETWEEN temp2.Month - 2 AND temp2.Month
ORDER BY e2.Month DESC
LIMIT 3
) AS Salary
FROM (
SELECT e1.Id, e1.Month, e1.Salary
FROM Employee e1, (
SELECT e.Id, MAX(e.Month) AS Month
FROM Employee e
GROUP BY e.Id
) temp1
WHERE e1.Id = temp1.Id AND e1.Month < temp1.Month
ORDER BY e1.Id ASC, e1.Month DESC
) temp2;
This is how to exclude the most recent month
SELECT e1.Id, e1.Month, e1.Salary
FROM Employee e1, (
SELECT e.Id, MAX(e.Month) AS Month
FROM Employee e
GROUP BY e.Id
) temp1
WHERE e1.Id = temp1.Id AND e1.Month < temp1.Month
ORDER BY e1.Id ASC, e1.Month DESC;
Using window function rank()
and RANGE 2 PRECEDING
(faster).
SELECT id, month, Salary
FROM(
SELECT id, month,
SUM(salary) OVER(PARTITION BY id ORDER BY month RANGE 2 PRECEDING) AS Salary,
rank() OVER(PARTITION BY id ORDER BY month DESC) AS rnk
FROM Employee
) t
WHERE rnk > 1
ORDER BY id, month DESC;
4 Self Join
LeetCode 1270
All People Report to the Given Manager (Medium) [link]
Union the selected employees at each of the three stages.
1] An employee directly reports to manager 1
2] An employee reports to managerA, managerA reports to manager 1.
3] An employee reports to managerB, managerB reports to managerA, managerA reports to manager 1.
One method:
SELECT a.employee_id AS employee_id
FROM
Employees AS a
LEFT JOIN
Employees AS b ON a.manager_id = b.employee_id
LEFT JOIN
Employees AS c ON b.manager_id = c.employee_id
LEFT JOIN
Employees AS d ON c.manager_id = d.employee_id
WHERE
a.employee_id != 1 AND d.employee_id = 1;
Explanation:
By joining by mapping the employee id of the next table to the manager id of the previous table, we keep track of the IDs that are able to report to manager 1. By selecting a.employee_id !=1
and d.employee_id = 1
, we select all IDs that are able to report to manager 1 within three steps.
SELECT
a.employee_id AS AID,
a.manager_id AS Ama,
b.employee_id AS BID,
b.manager_id AS Bma,
c.employee_id AS CID,
c.manager_id AS Cma,
d.employee_id AS DID,
d.manager_id AS Dma
FROM
Employees as a
left join
Employees as b on a.manager_id = b.employee_id
left join
Employees as c on b.manager_id = c.employee_id
left join
Employees as d on c.manager_id = d.employee_id
# +-----+-----+------+------+------+------+------+------+
# | AID | Ama | BID | Bma | CID | Cma | DID | Dma |
# +-----+-----+------+------+------+------+------+------+
# | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
# | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
# | 77 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
# | 4 | 2 | 2 | 1 | 1 | 1 | 1 | 1 |
# | 7 | 4 | 4 | 2 | 2 | 1 | 1 | 1 |
# | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
# | 8 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
# | 9 | 8 | 8 | 3 | 3 | 3 | 3 | 3 |
# +-----+-----+------+------+------+------+------+------+
Alternative method (same idea):
SELECT employee_id AS EMPLOYEE_ID
FROM Employees
WHERE manager_id IN (
SELECT employee_id
FROM Employees
WHERE manager_id IN (
SELECT employee_id
FROM Employees
WHERE manager_id = 1
)
) AND employee_id != 1
LeetCode 1811
Find Interview Candidates (Medium) [link]
WHERE Condition1 OR Condition2;
Advanced case of using WHERE
to filter table considering multiple tables.
SELECT u.name, u.mail
FROM Users u
WHERE(
SELECT COUNT(c1.contest_id)
FROM Contests c1, Contests c2, Contests c3
WHERE c2.contest_id - c1.contest_id = 1
AND c3.contest_id - c2.contest_id = 1
AND (c1.gold_medal = u.user_id OR c1.silver_medal = u.user_id OR c1.bronze_medal = u.user_id)
AND (c2.gold_medal = u.user_id OR c2.silver_medal = u.user_id OR c2.bronze_medal = u.user_id)
AND (c3.gold_medal = u.user_id OR c3.silver_medal = u.user_id OR c3.bronze_medal = u.user_id)
) >= 1
OR(
SELECT COUNT(*)
FROM Contests c
WHERE c.gold_medal = u.user_id
) >= 3;
Note that this is how to find consecutive medals
SELECT COUNT(c1.contest_id)
FROM Contests c1, Contests c2, Contests c3
WHERE c2.contest_id - c1.contest_id = 1
AND c3.contest_id - c2.contest_id = 1
AND (c1.gold_medal = u.user_id OR c1.silver_medal = u.user_id OR c1.bronze_medal = u.user_id)
AND (c2.gold_medal = u.user_id OR c2.silver_medal = u.user_id OR c2.bronze_medal = u.user_id)
AND (c3.gold_medal = u.user_id OR c3.silver_medal = u.user_id OR c3.bronze_medal = u.user_id)
LeetCode 1285
Find the Start and End Number of Continuous Ranges (Medium) [link]
First idea: use dense_rank()
to determine the range and then determine the start and end id in each range.
with startid as (
SELECT log_id AS start_id
FROM (
SELECT log_id,
dense_rank() OVER(PARTITION BY rnk_diff ORDER BY log_id DESC) AS desc_rnk,
dense_rank() OVER(PARTITION BY rnk_diff ORDER BY log_id) AS asc_rnk
FROM(
SELECT log_id,
log_id - dense_rank() OVER(ORDER BY log_id) AS rnk_diff
FROM Logs
) t
) tt
WHERE asc_rnk = 1
),
endid as (
SELECT log_id AS end_id
FROM (
SELECT log_id,
dense_rank() OVER(PARTITION BY rnk_diff ORDER BY log_id DESC) AS desc_rnk,
dense_rank() OVER(PARTITION BY rnk_diff ORDER BY log_id) AS asc_rnk
FROM(
SELECT log_id,
log_id - dense_rank() OVER(ORDER BY log_id) AS rnk_diff
FROM Logs
) t
) tt
WHERE desc_rnk = 1
)
SELECT startid.start_id as start_id, MIN(endid.end_id) as end_id
FROM startid, endid
WHERE startid.start_id <= endid.end_id
GROUP BY startid.start_id;
For each value log
in table Logs
, if log - 1
does not exist in Logs
, then log
is the start id of an interval. If log + 1
does not exist in Logs
, then log
is the end id of an interval. Join the start ids and the end ids to obtain the result.
Note that we need to group by start id and select the min end id, since after (one-to-many) joining start id and end id, for each start id, there are multiple end ids. We need the smallest one that’s larger than the start id.
SELECT log_start.log_id AS START_ID, MIN(log_end.log_id) AS END_ID
FROM
(SELECT log_id FROM Log WHERE log_id - 1 NOT IN (SELECT * FROM Log)) log_start,
(SELECT log_id FROM Log WHERE log_id + 1 NOT IN (SELECT * FROM Log)) log_end
WHERE log_start.log_id <= log_end.log_id
GROUP BY log_start.log_id;