1. Knowledge Hub
  2. Data Manipulation
  3. Multiple columns/Full table manipulations

Combine data from two or more sources or datasets - Union

In this article we will walk you through the techniques needed to combine data from two or more datasets. We will explicitly look at the UNION available in SQL to achieve the purpose.

Problem

You have data in two or more sources/spreadsheets, and the analysis you're attempting requires data from the two sources to be combined in some way.

Before going through this article we strongly recommend that you go through the following articles

1. Explaining SELECT Statement

2. Topics on Data Collection

Consider an example to better understand the problem statement:

To create a visualization as shown below :

This will help us to compare the Results achieved against the Targets for Number of people trained, and Number of people who got jobs  

TR1-1

For this scenario: Data residing in two datasets from different sources  are considered. The data files are listed below.

Targets_Baseline   Source - Excel File

This file stores all the information related to Targets and Baseline for each project

Results_Survey - Source - Sopact Survey

This file stores the information collected at specific intervals from the portfolios against specific metrics. 

A sneak peeks at the  columns, and the data of these datasets are referred to in the forthcoming section.

Targets_Baseline

Project Rep_Type Period Funding_amt Number_people_trained Number_people_got_jobs Income_generated Hours_training
Girls Code Target 3/31/2023 70000 1000 700 175000 200
Girls Code Baseline 4/1/2022 0 0 0 0 0

Results_Survey

Project Name Reporting Date Funding Executed Funding executed - Currency Number of people trained Number of people placed Hours of training Income Generated Income Generated - Currency Learnings Rep_Type
Girls Code 6/30/2022 2000 GBP 100 34 20 4500 GBP This the first quarter so low training Result
Girls Code 9/30/2022 15000 GBP 200 120 50 30000 GBP Second quarter learnings Result
Girls Code 12/31/2022 17000 GBP 240 170 50 42500 GBP Third quarter learnings Result

Problem statement for this use case :   We need to compare the Total Number of people trained and got jobs stored in dataset "Result_survey" against the Targets which resides in different dataset "Target_Baseline"?

Solution

To accomplish this, we need to combine data that reside in different datasets into one dataset and then compare the column ( These datasets can be from different sources: surveys, excel file or google sheets too).

 The solution can be achieved using two steps, as shown below :

Step 1. Combine Targets_Baseline and Results_Survey datasets as one dataset

Step 2.  Compare the  Number of people trained and Number of people who got jobs columns based on the Type.

The result of the output of Step 1 is as shown below : ( After combining both the datasets)

Project Period Number_people_trained Number_people_got_jobs Rep_Type
Girls Code 3/31/2023 1000 700 Target
Girls Code 4/1/2022 0 0 Baseline
Girls Code 6/30/2022 100 34 Result
Girls Code 9/30/2022 200 120 Result
Girls Code 12/31/2022 240 170 Result

Important Question: “How do we combine the data that resides in different datasets, "Targets_Baseline" and "Results_Survey" to  get the desired output as shown above”?

Purpose

This is precisely what the UNION ALL operator does. It provides instructions to combine results of both the datasets/tables ( Targets_Baseline and Results_Survey) and get desired columns into one dataset. In simple terms, it combines the two or more datasets into one dataset. Such datasets are called Virtual datasets.  Diagrammatically, it can be represented as follows :

Step 1 : Fetch specific column information from first dataset i.e.Targets_Baseline

The SELECT statement should look like as shown below

 SELECT "Project","Period",  "Number_people_trained",
"Number_people_got_jobs", "Rep_Type"
FROM "Targets_Baseline"

Note : Include column names "Project", "Period", "Number_people_trained", "Number_people_got_jobs", "Rep_Type"

Step 2 : Fetch specific column  information from second dataset i.e.Results_Survey 

The SELECT statement should look like as shown below

SELECT "Project Name","Reporting Date" , "Number of people trained",
"Number of people placed", "Rep_Type"
FROM "Results_survey"

Note : Include column names "Project Name", "Reporting Date", "Number of people trained", "Number of people placed", "Rep_Type"

The Final query to combine both the datasets using UNION ALL should be as follows :

-- Query from Step1

SELECT "Project","Period",  "Number_people_trained",
"Number_people_got_jobs", "Rep_Type"
FROM "Targets_Baseline" 

UNION ALL

-- Query from Step2

SELECT "Project Name","Reporting Date","Number of people trained",
"Number of people placed", "Rep_Type"
FROM "Results_survey"

 

TIP : Only select relevant columns from the dataset 

 

THE Syntax for UNION ALL SQL Query 

SELECT column1, column2, ... column_n

FROM schema_name.table_name

UNION ALL

SELECT column1, column2, ... column_n

FROM schema_name.table_name

UNION ALL

SELECT column1, column2, ... column_n

FROM schema_name.table_name

And so on . . .


NOTE : In scenarios similar to above, Just replace column name1, column name2, column name3,.... with column names  and with table names according to your datasets keeping an eye on the syntax . At the same time be mindful of points to remember stated below related to number of columns, datatype and order of column.


POINTS TO REMEMBER WHILE USING UNION:

  1. Each table used within UNION must have the same number of columns.

    1. Example : The first table "Targets_Baseline" has 5 columns in the
      SELECT statment. The second table "Results_survey" also
      should have 5 columns in the SELECT Statment

  2. The columns must have same data types. 

  3. The columns in each table must be in the same order.

Look at the example

Column number

& order

 

 

Targets_Baseline
Column names

 

 

Data type

 

Result_survey

Column names

1

Project

Text

Project Name

2

Period

Date

Reporting Date

3

Number_people_trained

Number

Number of people trained

4

Number_people_got_jobs

Number

"Number of people placed

5

Rep_Type

Text

Rep_Type

SO LETS GET STARTED to combine the datasets as one dataset Using UNION :

Objective: Combine Targets_Baseline and Results_Survey datasets as one dataset

To accomplish this, follow STEPS A to E

STEP A:

Go to SQL Lab as shown below : ( We will be using the two datasets described above that are highlighted. You can see all the datasets from Datasets menu )

r1-1

 

Step B:   Navigate to SQL Editor from the Menu bar, Select the appropriate data and the schema as shown below.  This information can be obtained and is highlighted in Red as shown above.

r2

 

Step C:  Now copy and paste the following query in the Right side space as shown below :

SELECT "Project","Period",  "Number_people_trained", "Number_people_got_jobs", "Rep_Type"
FROM target_result_test."Targets_Baseline" 
UNION ALL
SELECT "Project Name","Reporting Date" , "Number of people trained", "Number of people placed", "Rep_Type"
FROM target_result_test."Results_survey"

TIP: Please replace column names and table names according to your datasets, adhering to the syntax but being mindful of the datatype and order of the columns added.

 

r7

 

Step D : Click on the RUN button and execute the query and get the desired output. The results circled below is the combined results from both the tables. 

r4-1

NOTE:  Above circled results must be saved to create visualization, which is a combination of data from Targets_Baseline  and Result_Survey for selected columns.

 

STEP E: To save these results into a new dataset, click on CREATE CHART. It will open up a dialog box, as shown below. Give it an appropriate name. Let's name it "Combined_datasets"

r5

 

STEP F: Finally, "Combined_datasets" is created, shown below, a virtual dataset. This dataset has a combination of data from both the above-mentioned datasets.

r8

Congratulations! Using the UNION ALL operator, you learnt how to join two tables to generate a virtual dataset.

Now, this dataset is ready to create the bar chart visualization for comparing Results against the target for Number of people trained, and a number of people who got jobs.

Follow the steps mentioned in the article Creating a visualization - Bar Chart.

Snapshot of steps required for creating a Bar Chart for comparing Results / Targets on metrics: Number of People trained, Number of people got jobs. 

1. Select the dataset "Combined_datasets"

2. Select Bar Chart from the list of visualization

3. Metrics - Select Number of People trained and Number of people got jobs. Use SUM function

4. Dimensions - Select Rep_Type 

5. Create Chart

r6