Usage

Step 1: Configure the classification

Open the file config_classification.py. Set up the classification by giving it a classification version, a description and a region, e.g.:

CLASSIFICATION_VERSION = 1
VERSION_COMMENT = "set sample with inspected building quality"
CLASSIFICATION_REGION = 'Bayern'

A classification version is a unique identifier for your classification and can only be used once. The version comment can be used to describe the settings used in the classification version. For the region either choose one of the federal states of Germany

REGION_DICT = {1: 'Schleswig-Hohlstein',
           2: 'Hamburg',
           3: 'Niedersachsen',
           4: 'Bremen',
           5: 'Nordrhein-Westfalen',
           6: 'Hessen',
           7: 'Rheinland-Pfalz',
           8: 'Baden-Württemberg',
           9: 'Bayern',
           10: 'Saarland',
           11: 'Berlin',
           12: 'Brandenburg',
           13: 'Mecklenburg-Vorpommern',
           14: 'Sachsen',
           15: 'Sachsen-Anhalt',
           16: 'Thüringen',
           }
that are listed as values in the dictionary or use 'Germany' for the entire country.
Make sure that you have the building data for the chosen region.

Once you run the classification the constants that you have set will be saved in the database table classification_version.

Step 2: Calculate the Data for Clustering

To generate the grids, calculate the grid parameters and filter erroneous grids run the file prepare_data_for_clustering.py.

Threshold values for filtering can be set in clustering.config_clustering.

The processes run here are explained in Create a Sample Set of PLZ for Clustering, Grid Generation for Classification, Calculate Parameters for Grids and Filter Grids.

Step 3: Inspect sampling results (optional)

In the notebook examples_sampling.analyse_sampling_results the samples drawn in the region of classification are visualised on a map.

Default view

The distribution of samples within Regiostar 7 classes is shown.

Default view

Step 4: Inspect grid generation results (optional)

The generated grids can be visualised in QGIS (all grids)

Default view

or using plotting functions (individual grids). For more details see Visualisation

Step 5: Inspect grid parameters (optional)

In the package examples_grid_parameters the parameters of the grids can be analysed. In the notebook analyse_clustering_parameters a matrix of scatter plots called pairplot is shown to gain an overview of the data.

Default view

The grids can be sorted by a parameter to show grids with specific characteristics.

The notebook vsw_analysis focuses on the ‘Verbrauchersummenwiderstand’ (resistance in the network) that can be an indicator for voltage drop of branches.

Step 6: Choose parameters for clustering

To fast track step 6 you can call get_parameters_for_clustering. The optimal parameters for clustering are calculated and outputed in the console. They should then be inserted in clustering.config_clustering as explained below.

The package examples_correlation_and_factor_analysis has the tools to choose the parameters for clustering. In this work, it is proposed to choose the clustering parameters according to the factor analysis.

The notebook 1_0_factor_analysis guides you through the process of finding the number of and the parameters that are mathematically optimal for clustering. The resulting parameters are listed at the end of the document.

Default view

The proposed parameters do not need to be taken for clustering. Other preferences and considerations can be taken into account. Additional information like the explained variance of the factors or components can be found in the notebook 1_1_explained variance_eigen_decomposition.

Default view

The correlation matrix and clustermap are plotted in the notebook 1_2_correlation_matrix

After you have choosen the parameters set them in clustering.config_clustering, like:

# set clustering parameters
param1 = 'no_branches'
param2 = 'avg_trafo_dis'
param3 = 'max_no_of_households_of_a_branch'
param4 = 'no_house_connections_per_branch'
LIST_OF_CLUSTERING_PARAMETERS = [param1, param2, param3, param4]

Step 7: Choose number of clusters

To fast track step 6 you can call get_no_clusters_for_clustering. The optimal no_clusters for clustering are calculated and outputed in the console. They should then be inserted in clustering.config_clustering as explained below.

In the package examples_indices you will find two indices for finding the optimal number of clusters:

  • Calinski Harabasz Index or CH Index and

  • Davies Bouldin Index or DB Index

It is recommended to choose the number of clusters with the CH index from the notebook 1_CH_index. The DB Index can be used for reference.

Default view

Again according to the goals of clustering with orientation of the index results set the numbers of clusters for the cluster algorithms in clustering.config_clustering:

# set number of clusters
N_CLUSTERS_KMEDOID = 5
N_CLUSTERS_KMEANS = 5
N_CLUSTERS_GMM = 4  # refers to gmm tied

Step 8: Clustering results

You now have the option to investigate the results in examples_clustering. For each of the clusterin algorithms

  • kmeans,

  • kmedoids and

  • gmm tied

there are two notebooks. In the first one, the representative grids are presented. Their clustering parameters can be compared with the radar plot. The representative grids are plotted individually.

Default view

In the second notebook is more concerned with the overall clusters and the distribution of clusters over the regiostar classes are plotted.

Default view

To view the clustering results in QGIS run apply_clustering_for_QGIS_visualisation and open QGIS. There you have the option to identify the clusters of the grids by color

Default view

More details about the clustering functions can be found in Cluster the Grids