MINIBATCH_KMEANS

Mini-batch k-means is a faster approximation of standard k-means that updates cluster centers using small random batches instead of the full dataset on every iteration.

Like standard k-means, it aims to minimize the within-cluster sum of squares objective function:

J = \sum_{j=1}^{k} \sum_{x_i \in S_j} ||x_i - \mu_j||^2

where k is the number of clusters, S_j is the j-th cluster, and \mu_j is the centroid of S_j.

This wrapper keeps label assignment enabled so the fitted result always includes labels, a compact label count table, cluster centers, and inertia for Excel-friendly inspection.

Excel Usage

=MINIBATCH_KMEANS(data, n_clusters, batch_size, max_iter, max_no_improvement, random_state)
  • data (list[list], required): 2D array of input data with rows as samples and columns as features.
  • n_clusters (int, optional, default: 8): Number of clusters to fit.
  • batch_size (int, optional, default: 1024): Number of samples processed in each mini-batch.
  • max_iter (int, optional, default: 100): Maximum number of full passes over the data.
  • max_no_improvement (int, optional, default: 10): Consecutive mini-batches without improvement before early stopping.
  • random_state (int, optional, default: null): Integer seed for reproducible centroid initialization. Leave blank for non-deterministic runs.

Returns (dict): Excel data type containing inertia, labels, label counts, cluster centers, and iteration metadata.

Example 1: Fit two well-separated clusters with one batch pass

Inputs:

data n_clusters batch_size max_iter max_no_improvement random_state
0 0 2 6 100 10 0
0 1
1 0
5 5
5 6
6 5

Excel formula:

=MINIBATCH_KMEANS({0,0;0,1;1,0;5,5;5,6;6,5}, 2, 6, 100, 10, 0)

Expected output:

{"type":"Double","basicValue":2.73085,"properties":{"inertia":{"type":"Double","basicValue":2.73085},"cluster_count":{"type":"Double","basicValue":2},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":1},{"type":"Double","basicValue":3}]]},"cluster_centers":{"type":"Array","elements":[[{"type":"Double","basicValue":5.27273},{"type":"Double","basicValue":5.36364}],[{"type":"Double","basicValue":0.217391},{"type":"Double","basicValue":0.391304}]]},"n_iter":{"type":"Double","basicValue":15},"n_steps":{"type":"Double","basicValue":15}}}

Example 2: Fit three compact groups with smaller mini-batches

Inputs:

data n_clusters batch_size max_iter max_no_improvement random_state
1 1 3 4 120 8 5
1.2 0.8
0.8 1.1
8 8
8.2 7.9
7.8 8.1
15 1
15.2 0.8
14.8 1.1

Excel formula:

=MINIBATCH_KMEANS({1,1;1.2,0.8;0.8,1.1;8,8;8.2,7.9;7.8,8.1;15,1;15.2,0.8;14.8,1.1}, 3, 4, 120, 8, 5)

Expected output:

{"type":"Double","basicValue":0.364608,"properties":{"inertia":{"type":"Double","basicValue":0.364608},"cluster_count":{"type":"Double","basicValue":3},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":2}],[{"type":"Double","basicValue":2}],[{"type":"Double","basicValue":2}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":1},{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":2},{"type":"Double","basicValue":3}]]},"cluster_centers":{"type":"Array","elements":[[{"type":"Double","basicValue":0.978947},{"type":"Double","basicValue":0.989474}],[{"type":"Double","basicValue":8},{"type":"Double","basicValue":8}],[{"type":"Double","basicValue":15.0429},{"type":"Double","basicValue":0.935714}]]},"n_iter":{"type":"Double","basicValue":6},"n_steps":{"type":"Double","basicValue":12}}}

Example 3: Cluster a single Excel cell with one centroid

Inputs:

data n_clusters batch_size max_iter max_no_improvement random_state
5 1 1 20 5 0

Excel formula:

=MINIBATCH_KMEANS(5, 1, 1, 20, 5, 0)

Expected output:

{"type":"Double","basicValue":0,"properties":{"inertia":{"type":"Double","basicValue":0},"cluster_count":{"type":"Double","basicValue":1},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":0}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":1}]]},"cluster_centers":{"type":"Array","elements":[[{"type":"Double","basicValue":5}]]},"n_iter":{"type":"Double","basicValue":7},"n_steps":{"type":"Double","basicValue":7}}}

Example 4: Cluster one-dimensional samples into two minibatch groups

Inputs:

data n_clusters batch_size max_iter max_no_improvement random_state
0 2 3 80 6 3
0.2
0.4
4.8
5
5.2

Excel formula:

=MINIBATCH_KMEANS({0;0.2;0.4;4.8;5;5.2}, 2, 3, 80, 6, 3)

Expected output:

{"type":"Double","basicValue":0.542818,"properties":{"inertia":{"type":"Double","basicValue":0.542818},"cluster_count":{"type":"Double","basicValue":2},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":1},{"type":"Double","basicValue":3}]]},"cluster_centers":{"type":"Array","elements":[[{"type":"Double","basicValue":0.181818}],[{"type":"Double","basicValue":4.64324}]]},"n_iter":{"type":"Double","basicValue":13},"n_steps":{"type":"Double","basicValue":26}}}

Python Code

import numpy as np
from sklearn.cluster import MiniBatchKMeans as SklearnMiniBatchKMeans

def minibatch_kmeans(data, n_clusters=8, batch_size=1024, max_iter=100, max_no_improvement=10, random_state=None):
    """
    Cluster samples with mini-batch k-means for tabular data.

    See: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        data (list[list]): 2D array of input data with rows as samples and columns as features.
        n_clusters (int, optional): Number of clusters to fit. Default is 8.
        batch_size (int, optional): Number of samples processed in each mini-batch. Default is 1024.
        max_iter (int, optional): Maximum number of full passes over the data. Default is 100.
        max_no_improvement (int, optional): Consecutive mini-batches without improvement before early stopping. Default is 10.
        random_state (int, optional): Integer seed for reproducible centroid initialization. Leave blank for non-deterministic runs. Default is None.

    Returns:
        dict: Excel data type containing inertia, labels, label counts, cluster centers, and iteration metadata.
    """
    def to2d(value):
        return [[value]] if not isinstance(value, list) else value

    def parse_matrix(value):
        value = to2d(value)
        if not isinstance(value, list) or not value or not all(isinstance(row, list) and row for row in value):
            return None, "Error: data must be a non-empty 2D list"
        if len({len(row) for row in value}) != 1:
            return None, "Error: data must be a rectangular 2D list"
        matrix = np.array(value, dtype=float)
        if matrix.ndim != 2 or matrix.size == 0:
            return None, "Error: data must be a non-empty 2D list"
        if not np.isfinite(matrix).all():
            return None, "Error: data must contain only finite numeric values"
        return matrix, None

    def as_column(values):
        return [[{"type": "Double", "basicValue": float(item)}] for item in values]

    def as_matrix(values):
        return [[{"type": "Double", "basicValue": float(item)} for item in row] for row in values]

    def label_count_table(labels):
        unique_labels, counts = np.unique(labels, return_counts=True)
        rows = [[{"type": "String", "basicValue": "label"}, {"type": "String", "basicValue": "count"}]]
        rows.extend(
            [[{"type": "Double", "basicValue": float(label)}, {"type": "Double", "basicValue": float(count)}]
             for label, count in zip(unique_labels.tolist(), counts.tolist())]
        )
        return rows

    try:
        data_np, error = parse_matrix(data)
        if error:
            return error

        cluster_total = int(n_clusters)
        if cluster_total < 1:
            return "Error: n_clusters must be at least 1"
        if cluster_total > data_np.shape[0]:
            return "Error: n_clusters cannot exceed the number of samples"
        if int(batch_size) < 1:
            return "Error: batch_size must be at least 1"
        if int(max_iter) < 1:
            return "Error: max_iter must be at least 1"
        if int(max_no_improvement) < 0:
            return "Error: max_no_improvement must be non-negative"

        seed = None if random_state in (None, "") else int(random_state)
        fitted = SklearnMiniBatchKMeans(
            n_clusters=cluster_total,
            batch_size=int(batch_size),
            max_iter=int(max_iter),
            max_no_improvement=int(max_no_improvement),
            random_state=seed,
            compute_labels=True,
            n_init="auto"
        ).fit(data_np)

        labels = fitted.labels_
        cluster_count = int(np.unique(labels).size)

        return {
            "type": "Double",
            "basicValue": float(fitted.inertia_),
            "properties": {
                "inertia": {"type": "Double", "basicValue": float(fitted.inertia_)},
                "cluster_count": {"type": "Double", "basicValue": float(cluster_count)},
                "labels": {"type": "Array", "elements": as_column(labels.tolist())},
                "label_counts": {"type": "Array", "elements": label_count_table(labels)},
                "cluster_centers": {"type": "Array", "elements": as_matrix(fitted.cluster_centers_.tolist())},
                "n_iter": {"type": "Double", "basicValue": float(fitted.n_iter_)},
                "n_steps": {"type": "Double", "basicValue": float(fitted.n_steps_)}
            }
        }
    except Exception as e:
        return f"Error: {str(e)}"

Online Calculator

2D array of input data with rows as samples and columns as features.
Number of clusters to fit.
Number of samples processed in each mini-batch.
Maximum number of full passes over the data.
Consecutive mini-batches without improvement before early stopping.
Integer seed for reproducible centroid initialization. Leave blank for non-deterministic runs.