This lesson is in the early stages of development (Alpha version)

06 Aggregations

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • How can I summarise the data in my tables

  • How can I make sure my column names make sense?

Objectives
  • Apply aggregate functions (MIN, MAX, AVG) to group records together

  • Use the ‘group by’ clause to summarise data

  • Use built-in statistical functions to provide column summaries

  • Use the ‘having’ clause to provide selection criteria to the summary values

  • Understand the difference between the ‘where’ and the ‘having’ clauses

  • Saving a query to make a new table

Using built-in statistical functions

Aggregation allows us to combine results by greouping records based on certain values such as an average across a group of rows.

Let’s go the Farms table and find out how many farms there are. Using the wildcard * and the COUNT function to count the number of records (rows):

SELECT COUNT (*)
FROM Farms; 

We can also find out the minimum, maximum and average number of years farmed using the ‘A11_years_farm’, by writing a query such as this;

SELECT 
       MIN(A11_years_farm),
       MAX(A11_years_farm),
       AVG(A11_years_farm)
FROM Farms; 

This sort of query provides us with a general view of the values for a particular column or field across the whole table.

MIN , MAX and AVG are builtin aggregate functions in SQLite (and any other SQL database system). There are many other aggregate functions included in SQL, for example SUM .

The group by clause to summarise data

Now lets COUNT how many farms are in each different ward (in the ‘A08_ward’ column). We do this using the GROUP BY clause.

SELECT A08_ward, COUNT(*) AS How_many
FROM Farms
GROUP BY A08_ward;

GROUP BY tells SQL what field or fields we want to use to aggregate the data. If we want to group by multiple fields, we give GROUP BY a comma separated list.

For example if we want to count how many farms there are in each province, district, ward and village we can write the following SQL command.

The grouping will take place based on the order of the columns listed in the GROUP BY clause. There will be one row returned for each unique combination of the columns mentioned in the GROUP BY clause

SELECT A06_province, 
       A07_district,
       A08_ward,
       A09_village,
       COUNT(*) AS How_many
FROM Farms
GROUP BY A06_province, A07_district, A08_ward, A09_village
;

Exercise

Write a query which returns the min, max and avg number of years farmed (A11_years_farm) as well as a count of the number of records involved for the ‘A11_years_farm’ column for each village (A09_village) in the ‘Nhamatanda’ district (0A7_district) using the Farms table.

Solution

SELECT A09_village,
       MIN(A11_years_farm) AS min,
       MAX(A11_years_farm) AS max,
       AVG(A11_years_farm) AS avg,
       COUNT(*) AS how_many
FROM Farms
WHERE A07_district = 'Nhamatanda'
GROUP BY A09_village;

Notice that you can use the ‘A07_district’ column in the where clause but it doesn’t have to appear in the select clause.

The having keyword

In the previous episode, we used the keyword WHERE, allowing us to filter the results according to some criteria.

SQL offers a mechanism to filter the results based on aggregate functions, through the HAVING keyword. In a HAVING clause you can use the column alias to refer to the aggregated column.

For example, we can write a query to return the min, max and count of farms in each ward and to only return information about wards with more than 2 farms.

SELECT A08_ward,
       MIN(A11_years_farm) AS min_years,
       MAX(A11_years_farm) AS max_years,
       COUNT(*) As how_many_farms
FROM Farms
GROUP BY A08_ward
HAVING how_many_farms > 2;

You can use the AS keyword to assign an alias to a column or table, and refer to that alias in the HAVING clause. For example, in the above query, we can call the COUNT(*) by another name, like how_many_farms.

Exercise

Using the Crops table write a query which will list all of the crops (D_curr_crop) which are grown in over 100 plots.

Solution

SELECT D_curr_crop, COUNT(*) AS how_many
FROM Crops
GROUP BY D_curr_crop
HAVING how_many > 100
;

Key Points

  • Builtin functions can be used to produce a variety of summary statistics

  • The DISTINCT keyword can be used to find the unique set of values in a column or columns

  • Data in columns can be summarised by values using the GROUP BY clause

  • Summarised data can be filtered using the HAVING clause