BIRCH_CLUSTER

BIRCH incrementally compresses samples into subclusters using a tree structure and then performs a final clustering step on those subcluster summaries.

It efficiently summarizes clusters using a Clustering Feature (CF), defined as a tuple:

CF = (N, \vec{LS}, SS)

where N is the number of data points in the cluster, \vec{LS} = \sum_{i=1}^{N} \vec{x}_i is the linear sum of the points, and SS = \sum_{i=1}^{N} ||\vec{x}_i||^2 is the sum of their squared norms.

This wrapper keeps label computation enabled and uses an integer final cluster count so the fitted result reliably includes labels, label counts, subcluster centers, and subcluster labels.

Excel Usage

=BIRCH_CLUSTER(data, threshold, branching_factor, n_clusters)
  • data (list[list], required): 2D array of input data with rows as samples and columns as features.
  • threshold (float, optional, default: 0.5): Maximum radius allowed when absorbing a sample into an existing subcluster.
  • branching_factor (int, optional, default: 50): Maximum number of subclusters per node in the clustering tree.
  • n_clusters (int, optional, default: 3): Number of final clusters formed from the learned subclusters.

Returns (dict): Excel data type containing final cluster counts, labels, label counts, and subcluster summaries.

Example 1: Cluster two compact groups with default BIRCH behavior

Inputs:

data threshold branching_factor n_clusters
0 0 0.5 50 2
0 1
1 0
5 5
5 6
6 5

Excel formula:

=BIRCH_CLUSTER({0,0;0,1;1,0;5,5;5,6;6,5}, 0.5, 50, 2)

Expected output:

{"type":"Double","basicValue":2,"properties":{"cluster_count":{"type":"Double","basicValue":2},"subcluster_count":{"type":"Double","basicValue":4},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":1},{"type":"Double","basicValue":3}]]},"subcluster_centers":{"type":"Array","elements":[[{"type":"Double","basicValue":0},{"type":"Double","basicValue":0.5}],[{"type":"Double","basicValue":1},{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":5},{"type":"Double","basicValue":5.5}],[{"type":"Double","basicValue":6},{"type":"Double","basicValue":5}]]},"subcluster_labels":{"type":"Array","elements":[[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}]]}}}

Example 2: Summarize three compact groups into final clusters

Inputs:

data threshold branching_factor n_clusters
1 1 0.4 20 3
1.2 0.8
0.8 1.1
8 8
8.2 7.9
7.8 8.1
15 1
15.2 0.8
14.8 1.1

Excel formula:

=BIRCH_CLUSTER({1,1;1.2,0.8;0.8,1.1;8,8;8.2,7.9;7.8,8.1;15,1;15.2,0.8;14.8,1.1}, 0.4, 20, 3)

Expected output:

{"type":"Double","basicValue":3,"properties":{"cluster_count":{"type":"Double","basicValue":3},"subcluster_count":{"type":"Double","basicValue":3},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":2}],[{"type":"Double","basicValue":2}],[{"type":"Double","basicValue":2}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":1},{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":2},{"type":"Double","basicValue":3}]]},"subcluster_centers":{"type":"Array","elements":[[{"type":"Double","basicValue":1},{"type":"Double","basicValue":0.966667}],[{"type":"Double","basicValue":8},{"type":"Double","basicValue":8}],[{"type":"Double","basicValue":15},{"type":"Double","basicValue":0.966667}]]},"subcluster_labels":{"type":"Array","elements":[[{"type":"Double","basicValue":2}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":0}]]}}}

Example 3: Cluster a single Excel cell into one BIRCH cluster

Inputs:

data threshold branching_factor n_clusters
5 0.5 10 1

Excel formula:

=BIRCH_CLUSTER(5, 0.5, 10, 1)

Expected output:

{"type":"Double","basicValue":1,"properties":{"cluster_count":{"type":"Double","basicValue":1},"subcluster_count":{"type":"Double","basicValue":1},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":0}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":1}]]},"subcluster_centers":{"type":"Array","elements":[[{"type":"Double","basicValue":5}]]},"subcluster_labels":{"type":"Array","elements":[[{"type":"Double","basicValue":0}]]}}}

Example 4: Build BIRCH clusters for one-dimensional samples

Inputs:

data threshold branching_factor n_clusters
0 0.2 10 2
0.1
0.2
4.8
5
5.2

Excel formula:

=BIRCH_CLUSTER({0;0.1;0.2;4.8;5;5.2}, 0.2, 10, 2)

Expected output:

{"type":"Double","basicValue":2,"properties":{"cluster_count":{"type":"Double","basicValue":2},"subcluster_count":{"type":"Double","basicValue":2},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":1},{"type":"Double","basicValue":3}]]},"subcluster_centers":{"type":"Array","elements":[[{"type":"Double","basicValue":0.1}],[{"type":"Double","basicValue":5}]]},"subcluster_labels":{"type":"Array","elements":[[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":0}]]}}}

Python Code

import numpy as np
from sklearn.cluster import Birch as SklearnBirch

def birch_cluster(data, threshold=0.5, branching_factor=50, n_clusters=3):
    """
    Cluster samples with the BIRCH incremental clustering algorithm.

    See: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        data (list[list]): 2D array of input data with rows as samples and columns as features.
        threshold (float, optional): Maximum radius allowed when absorbing a sample into an existing subcluster. Default is 0.5.
        branching_factor (int, optional): Maximum number of subclusters per node in the clustering tree. Default is 50.
        n_clusters (int, optional): Number of final clusters formed from the learned subclusters. Default is 3.

    Returns:
        dict: Excel data type containing final cluster counts, labels, label counts, and subcluster summaries.
    """
    def to2d(value):
        return [[value]] if not isinstance(value, list) else value

    def parse_matrix(value):
        value = to2d(value)
        if not isinstance(value, list) or not value or not all(isinstance(row, list) and row for row in value):
            return None, "Error: data must be a non-empty 2D list"
        if len({len(row) for row in value}) != 1:
            return None, "Error: data must be a rectangular 2D list"
        matrix = np.array(value, dtype=float)
        if matrix.ndim != 2 or matrix.size == 0:
            return None, "Error: data must be a non-empty 2D list"
        if not np.isfinite(matrix).all():
            return None, "Error: data must contain only finite numeric values"
        return matrix, None

    def as_column(values):
        return [[{"type": "Double", "basicValue": float(item)}] for item in values]

    def as_matrix(values):
        return [[{"type": "Double", "basicValue": float(item)} for item in row] for row in values]

    def label_count_table(labels):
        unique_labels, counts = np.unique(labels, return_counts=True)
        rows = [[{"type": "String", "basicValue": "label"}, {"type": "String", "basicValue": "count"}]]
        rows.extend(
            [[{"type": "Double", "basicValue": float(label)}, {"type": "Double", "basicValue": float(count)}]
             for label, count in zip(unique_labels.tolist(), counts.tolist())]
        )
        return rows

    try:
        data_np, error = parse_matrix(data)
        if error:
            return error

        if float(threshold) <= 0:
            return "Error: threshold must be greater than 0"
        if int(branching_factor) < 2:
            return "Error: branching_factor must be at least 2"

        cluster_total = int(n_clusters)
        if cluster_total < 1:
            return "Error: n_clusters must be at least 1"
        if cluster_total > data_np.shape[0]:
            return "Error: n_clusters cannot exceed the number of samples"

        if data_np.shape[0] == 1 and cluster_total == 1:
            labels = np.array([0])
            return {
                "type": "Double",
                "basicValue": 1.0,
                "properties": {
                    "cluster_count": {"type": "Double", "basicValue": 1.0},
                    "subcluster_count": {"type": "Double", "basicValue": 1.0},
                    "labels": {"type": "Array", "elements": as_column(labels.tolist())},
                    "label_counts": {"type": "Array", "elements": label_count_table(labels)},
                    "subcluster_centers": {"type": "Array", "elements": as_matrix(data_np.tolist())},
                    "subcluster_labels": {"type": "Array", "elements": as_column(labels.tolist())}
                }
            }

        fitted = SklearnBirch(
            threshold=float(threshold),
            branching_factor=int(branching_factor),
            n_clusters=cluster_total,
            compute_labels=True
        ).fit(data_np)

        labels = fitted.labels_
        cluster_count = int(np.unique(labels).size)
        subcluster_total = int(len(fitted.subcluster_centers_))

        return {
            "type": "Double",
            "basicValue": float(cluster_count),
            "properties": {
                "cluster_count": {"type": "Double", "basicValue": float(cluster_count)},
                "subcluster_count": {"type": "Double", "basicValue": float(subcluster_total)},
                "labels": {"type": "Array", "elements": as_column(labels.tolist())},
                "label_counts": {"type": "Array", "elements": label_count_table(labels)},
                "subcluster_centers": {"type": "Array", "elements": as_matrix(fitted.subcluster_centers_.tolist())},
                "subcluster_labels": {"type": "Array", "elements": as_column(fitted.subcluster_labels_.tolist())}
            }
        }
    except Exception as e:
        return f"Error: {str(e)}"

Online Calculator

2D array of input data with rows as samples and columns as features.
Maximum radius allowed when absorbing a sample into an existing subcluster.
Maximum number of subclusters per node in the clustering tree.
Number of final clusters formed from the learned subclusters.