1. Knowledge Hub
  2. Data Manipulation
  3. SQL basics for Data Manipulation

Explain SELECT Statement with GROUP BY

In this article we will walk you through the GROUP BY clause that operates on groups of rows. You will also learn to compute aggregates that apply to these groups.

Problem : 

You need to perform an analysis that requires data aggregation for understanding of outcomes.

Solution

Pre-requisites

Before going through this article we strongly recommend that you go through the following articles

1. Getting started with SQL for Impact Measurement

2. Explaining SELECT Statement

3. Explaining SELECT Statement with WHERE

Purpose

In order to group similar data into sets and return a single value for each group, the GROUP BY clause is used. This clause is used in conjunction with the SELECT statement to group together dataset rows that contain identical data.

The columns to be retrieved are specified in the SELECT statement and separated by commas. Any of the aggregate functions can be used on one or more than one of the columns being retrieved.

To group the result set by one or more columns, the GROUP BY clause is frequently used in conjunction with the aggregate functions COUNT(), MAX(), MIN(), SUM(), and AVG().

Syntax of GROUP BY CLAUSE IS AS FOLLOWS:

SELECT column1, column2
FROM table_name
WHERE [ conditions ]
GROUP BY column1, column2
HAVING [conditions]
ORDER BY column1, column2


NOTE :

Important Points to remember 

  • GROUP BY clause is used with the SELECT statement.
  • In the query, GROUP BY clause is placed after the WHERE clause.
  • In the query, GROUP BY clause is placed before ORDER BY clause if used any.
  • In the query , Group BY clause is placed before Having clause .
  • Place condition in the HAVING clause 

Aggregate Functions

The five aggregate functions that we can use with the SELECT  statement are:

  • AVG(): Calculates the average of the set of values.
  • COUNT(): Returns the count of rows.
  • SUM(): Calculates the arithmetic sum of the set of numeric values.
  • MAX(): From a group of values, returns the maximum value.
  • MIN(): From a group of values, returns the minimum value.

Let's understand how to use GROUP BY Clause

Let us consider the Survey table that is shown below to understand GROUP BY clause.

id name Did you get a job after completing the course? position What is your annual compensation? How satisfied are you with the following - Fell supported at work How satisfied are you with the following- Facilities at work How satisfied are you with the following-Feel safe at work Are you actively interviewing right now?
1 Madhukar Prabhakara Yes Data Analyst 70000 Somewhat satisfied Not Satisfied Very Satisfied No
2 Lore Berron Cadenas Yes Software Engineer 60000 Not Satisfied Not Satisfied Very Satisfied No
3 Latika Bahadur (Sopact) Yes Software Engineer 90000 Very Satisfied Not Satisfied Very Satisfied No
4 Vaishnavi Yes Software Engineer 83000 Very Satisfied Not Satisfied Neutral No
5 HETAL SHETH Yes Software Engineer 72000 Neutral Not Satisfied Neutral No
6 Ritvik No   0       No
7 Unmesh Sheth No   0       No
8 Shiji No   0       Yes
9 Ricardo Pinho No   0       Yes
10 Lorena No   0       No
11 Farhan No   0       Yes
12 Richa No   0       No
13 Phoram No   0       Yes
14 Arpitha  No   0       Yes
15 Adriano No   0       No
16 Caroline  No   0       No
17 Samantha No   0       Yes
18 Lily No   0       No

Examples

1. To know total number of people in each position,  then GROUP BY query would be as follows − 

SELECT position, count("position") FROM survey GROUP BY position

OR

SELECT position, count(*) FROM survey GROUP BY position

TIP : To adapt to your dataset, replace "position" column with a column from your dataset. Replace "survey" dataset with your dataset as well. Don't forget to add the column name to the GROUP BY clause.Include the aggregate function on the correct column.

* (asterisk) in count indicates the all or everything in the grouped column.

The output should be as follows :

position count
Software Engineer 4
Data Analyst 1
NULL 13

NOTE : This statement will group together the identical data in the  position column(grouped as 1, 2, & 3 based on Null, Data Analyst and Software Engineer) in the dataset as shown below. So, this statement will  first group them together then count the number of identical data in each group which gives the above result.

2.  To know total annual compensation in each position, then the GROUP BY query would be as follows -

SELECT position,SUM("What is your annual compensation?") FROM survey
group by position

TIP : Replace "position" column with a column from your dataset. Replace "survey" dataset with your dataset as well. Don't forget to add the column name to the GROUP BY clause.Include the aggregate function on the correct column. 

The output would be as follows :

position sum
NULL 0
Software Engineer 305000
Data Analyst 70000

NOTE : It will group together the identital data in the  position column(grouped as 1, 2, & 3 based on Null, Data Analyst and Software Engineer)) in the dataset as shown above in the fig.   So, this statement will  first group them together then perform sum on corresponding "What is  your annual compensation? column  - adding all the value of foresaid column in each group which gives the above result.

3. To know Maximum annual compensation in each position, then the GROUP BY query would be as follows -

 SELECT position,MAX("What is your annual compensation?") FROM survey
group by position

The output would be as follows :

position max
Software Engineer 90000
Data Analyst 70000
NULL 0

NOTE : The identical data in the position column of the dataset will be grouped together as group (1,2 & 3 based on Null, Data Analyst and Software Engineer)).  So, this statement will  first group them together then compute the Maximum on corresponding "What is  your annual compensation? column  - display maximum value of the foresaid column in each group which gives the above result.

4. To know Minimum and the Maximum annual compensation in each position, then the GROUP BY query would be as follows -

SELECT position,MAX("What is your annual compensation?"),
MIN("What is your annual compensation?") FROM survey
group by position

TIP :  You can add multiple aggregate function on GROUP BY.  

The output would be as follows :

position max min
NULL 0 0
Software Engineer 90000 60000
Data Analyst 70000 70000

NOTE :  The identical data in the position column of the dataset will be grouped together into 3 groups ( 1, 2 and 3 based on Null, Data Analyst and Software Enginerr)  So, this statement will  first group position column together then compute the Maximum and Minimum on corresponding "What is  your annual compensation? column  - display maximum and minimum value of the foresaid column in each group which gives the above result.

5. To know total respondent to the survey , as well as learn highest  and lowest compensation, so the SELECT statement should be like this :

SELECT count(*), max("What is your annual compensation?"),
max("What is your annual compensation?")   from survey

TIP : When a aggregate function is used without a GROUP BY clause, then aggregation function computes on the entire dataset irrespective of any column. The SELECT statement can list ONLY columns that are used in GROUP BY along with these funcitons.  

Count(*) - * in asterisk indicates all or everything in the group that is considered.

 The output would be as follows :

count max min
18 90000 0

NOTE : This SELECT statement first considers the entire dataset as one group as shown below and computes the COUNT, MAX and MIN on the entire dataset which displays the above result.

6. To learn how many of the respondents to the survey have found employment, and also know position wise average compensation, so the GROUP BY statement would like this

SELECT 
"Did you get a job after completing the course?", 
position,
AVG ("What is your annual compensation?")
FROM survey
group by "Did you get a job after completing the course?",position

TIP : You can GROUP BY multiple columns

The output would be as follows :

Did you get a job after completing the course? position avg
No NULL 0
Yes Software Engineer 76250
Yes Data Analyst 70000

A small tip : If there are three groups, then 3 records. Two groups then 2 records  .....

In this case, we have 3 groups - A.1, B.1, B.2 so three records.

NOTE : It will group together the identital data in the "Did you get a job after completing the course?" column( grouped as A and B based on Yes or No) and then group the position column (A.1, B.1 & B.2 based on Null, Data Analyst and Software Engineer) in the dataset as shown below. Based on identical values in each group (A.1, B.1 & B.2),  the average on compensation column is computed which gives the above result.  It looks as shown below:

 

NOTE : To adapt the GROUP BY clause to your use case. Replace column name and dataset name according to your dataset. Kindly adhere to the syntax to avoid errors.Don't forget to add the column name to the GROUP BY clause if needed. Include the aggregate function on the correct column for computations.

Let's practice an example of GROUP BY on Impact Cloud Dashboard

Login to Impact Cloud instance using the credentials. For this example, we will be using Impact Cloud Demo instance. 

Go to SQL Lab, Select SQL Editor as shown below:

  • Let's learn how many of the respondents to the survey have found employment, and also know position wise average compensation on Impact Cloud Dashboard.

To do this, write / copy paste the below query on the SQL Editor as shown below. Click RUN to execute the query to display the results.

SELECT 
"Did you get a job after completing the course?", 
position,AVG ("What is your annual compensation?")
FROM gc_demo.survey
group by "Did you get a job after completing the course?",position

NOTE: Impact Cloud's dataset is identified by its schema name. dataset, therefore gc_demo.survey is used instead of survey.   gc_demo is the schema where the courses dataset resides.

TO DO :

1. Execute all the above examples of GROUP BY clause with SELECT Statement to get familiar with Impact Cloud.

2. Adapt the SELECT Statement with GROUP BY clause to suit your requirement.

Congratulations !! You have learnt to use GROUP BY clause.