MINIBATCH_KMEANS
Mini-batch k-means is a faster approximation of standard k-means that updates cluster centers using small random batches instead of the full dataset on every iteration.
Like standard k-means, it aims to minimize the within-cluster sum of squares objective function:
J = \sum_{j=1}^{k} \sum_{x_i \in S_j} ||x_i - \mu_j||^2
where k is the number of clusters, S_j is the j-th cluster, and \mu_j is the centroid of S_j.
This wrapper keeps label assignment enabled so the fitted result always includes labels, a compact label count table, cluster centers, and inertia for Excel-friendly inspection.
Excel Usage
=MINIBATCH_KMEANS(data, n_clusters, batch_size, max_iter, max_no_improvement, random_state)
data(list[list], required): 2D array of input data with rows as samples and columns as features.n_clusters(int, optional, default: 8): Number of clusters to fit.batch_size(int, optional, default: 1024): Number of samples processed in each mini-batch.max_iter(int, optional, default: 100): Maximum number of full passes over the data.max_no_improvement(int, optional, default: 10): Consecutive mini-batches without improvement before early stopping.random_state(int, optional, default: null): Integer seed for reproducible centroid initialization. Leave blank for non-deterministic runs.
Returns (dict): Excel data type containing inertia, labels, label counts, cluster centers, and iteration metadata.
Example 1: Fit two well-separated clusters with one batch pass
Inputs:
| data | n_clusters | batch_size | max_iter | max_no_improvement | random_state | |
|---|---|---|---|---|---|---|
| 0 | 0 | 2 | 6 | 100 | 10 | 0 |
| 0 | 1 | |||||
| 1 | 0 | |||||
| 5 | 5 | |||||
| 5 | 6 | |||||
| 6 | 5 |
Excel formula:
=MINIBATCH_KMEANS({0,0;0,1;1,0;5,5;5,6;6,5}, 2, 6, 100, 10, 0)
Expected output:
{"type":"Double","basicValue":2.73085,"properties":{"inertia":{"type":"Double","basicValue":2.73085},"cluster_count":{"type":"Double","basicValue":2},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":1},{"type":"Double","basicValue":3}]]},"cluster_centers":{"type":"Array","elements":[[{"type":"Double","basicValue":5.27273},{"type":"Double","basicValue":5.36364}],[{"type":"Double","basicValue":0.217391},{"type":"Double","basicValue":0.391304}]]},"n_iter":{"type":"Double","basicValue":15},"n_steps":{"type":"Double","basicValue":15}}}
Example 2: Fit three compact groups with smaller mini-batches
Inputs:
| data | n_clusters | batch_size | max_iter | max_no_improvement | random_state | |
|---|---|---|---|---|---|---|
| 1 | 1 | 3 | 4 | 120 | 8 | 5 |
| 1.2 | 0.8 | |||||
| 0.8 | 1.1 | |||||
| 8 | 8 | |||||
| 8.2 | 7.9 | |||||
| 7.8 | 8.1 | |||||
| 15 | 1 | |||||
| 15.2 | 0.8 | |||||
| 14.8 | 1.1 |
Excel formula:
=MINIBATCH_KMEANS({1,1;1.2,0.8;0.8,1.1;8,8;8.2,7.9;7.8,8.1;15,1;15.2,0.8;14.8,1.1}, 3, 4, 120, 8, 5)
Expected output:
{"type":"Double","basicValue":0.364608,"properties":{"inertia":{"type":"Double","basicValue":0.364608},"cluster_count":{"type":"Double","basicValue":3},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":2}],[{"type":"Double","basicValue":2}],[{"type":"Double","basicValue":2}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":1},{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":2},{"type":"Double","basicValue":3}]]},"cluster_centers":{"type":"Array","elements":[[{"type":"Double","basicValue":0.978947},{"type":"Double","basicValue":0.989474}],[{"type":"Double","basicValue":8},{"type":"Double","basicValue":8}],[{"type":"Double","basicValue":15.0429},{"type":"Double","basicValue":0.935714}]]},"n_iter":{"type":"Double","basicValue":6},"n_steps":{"type":"Double","basicValue":12}}}
Example 3: Cluster a single Excel cell with one centroid
Inputs:
| data | n_clusters | batch_size | max_iter | max_no_improvement | random_state |
|---|---|---|---|---|---|
| 5 | 1 | 1 | 20 | 5 | 0 |
Excel formula:
=MINIBATCH_KMEANS(5, 1, 1, 20, 5, 0)
Expected output:
{"type":"Double","basicValue":0,"properties":{"inertia":{"type":"Double","basicValue":0},"cluster_count":{"type":"Double","basicValue":1},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":0}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":1}]]},"cluster_centers":{"type":"Array","elements":[[{"type":"Double","basicValue":5}]]},"n_iter":{"type":"Double","basicValue":7},"n_steps":{"type":"Double","basicValue":7}}}
Example 4: Cluster one-dimensional samples into two minibatch groups
Inputs:
| data | n_clusters | batch_size | max_iter | max_no_improvement | random_state |
|---|---|---|---|---|---|
| 0 | 2 | 3 | 80 | 6 | 3 |
| 0.2 | |||||
| 0.4 | |||||
| 4.8 | |||||
| 5 | |||||
| 5.2 |
Excel formula:
=MINIBATCH_KMEANS({0;0.2;0.4;4.8;5;5.2}, 2, 3, 80, 6, 3)
Expected output:
{"type":"Double","basicValue":0.542818,"properties":{"inertia":{"type":"Double","basicValue":0.542818},"cluster_count":{"type":"Double","basicValue":2},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":1},{"type":"Double","basicValue":3}]]},"cluster_centers":{"type":"Array","elements":[[{"type":"Double","basicValue":0.181818}],[{"type":"Double","basicValue":4.64324}]]},"n_iter":{"type":"Double","basicValue":13},"n_steps":{"type":"Double","basicValue":26}}}
Python Code
import numpy as np
from sklearn.cluster import MiniBatchKMeans as SklearnMiniBatchKMeans
def minibatch_kmeans(data, n_clusters=8, batch_size=1024, max_iter=100, max_no_improvement=10, random_state=None):
"""
Cluster samples with mini-batch k-means for tabular data.
See: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html
This example function is provided as-is without any representation of accuracy.
Args:
data (list[list]): 2D array of input data with rows as samples and columns as features.
n_clusters (int, optional): Number of clusters to fit. Default is 8.
batch_size (int, optional): Number of samples processed in each mini-batch. Default is 1024.
max_iter (int, optional): Maximum number of full passes over the data. Default is 100.
max_no_improvement (int, optional): Consecutive mini-batches without improvement before early stopping. Default is 10.
random_state (int, optional): Integer seed for reproducible centroid initialization. Leave blank for non-deterministic runs. Default is None.
Returns:
dict: Excel data type containing inertia, labels, label counts, cluster centers, and iteration metadata.
"""
def to2d(value):
return [[value]] if not isinstance(value, list) else value
def parse_matrix(value):
value = to2d(value)
if not isinstance(value, list) or not value or not all(isinstance(row, list) and row for row in value):
return None, "Error: data must be a non-empty 2D list"
if len({len(row) for row in value}) != 1:
return None, "Error: data must be a rectangular 2D list"
matrix = np.array(value, dtype=float)
if matrix.ndim != 2 or matrix.size == 0:
return None, "Error: data must be a non-empty 2D list"
if not np.isfinite(matrix).all():
return None, "Error: data must contain only finite numeric values"
return matrix, None
def as_column(values):
return [[{"type": "Double", "basicValue": float(item)}] for item in values]
def as_matrix(values):
return [[{"type": "Double", "basicValue": float(item)} for item in row] for row in values]
def label_count_table(labels):
unique_labels, counts = np.unique(labels, return_counts=True)
rows = [[{"type": "String", "basicValue": "label"}, {"type": "String", "basicValue": "count"}]]
rows.extend(
[[{"type": "Double", "basicValue": float(label)}, {"type": "Double", "basicValue": float(count)}]
for label, count in zip(unique_labels.tolist(), counts.tolist())]
)
return rows
try:
data_np, error = parse_matrix(data)
if error:
return error
cluster_total = int(n_clusters)
if cluster_total < 1:
return "Error: n_clusters must be at least 1"
if cluster_total > data_np.shape[0]:
return "Error: n_clusters cannot exceed the number of samples"
if int(batch_size) < 1:
return "Error: batch_size must be at least 1"
if int(max_iter) < 1:
return "Error: max_iter must be at least 1"
if int(max_no_improvement) < 0:
return "Error: max_no_improvement must be non-negative"
seed = None if random_state in (None, "") else int(random_state)
fitted = SklearnMiniBatchKMeans(
n_clusters=cluster_total,
batch_size=int(batch_size),
max_iter=int(max_iter),
max_no_improvement=int(max_no_improvement),
random_state=seed,
compute_labels=True,
n_init="auto"
).fit(data_np)
labels = fitted.labels_
cluster_count = int(np.unique(labels).size)
return {
"type": "Double",
"basicValue": float(fitted.inertia_),
"properties": {
"inertia": {"type": "Double", "basicValue": float(fitted.inertia_)},
"cluster_count": {"type": "Double", "basicValue": float(cluster_count)},
"labels": {"type": "Array", "elements": as_column(labels.tolist())},
"label_counts": {"type": "Array", "elements": label_count_table(labels)},
"cluster_centers": {"type": "Array", "elements": as_matrix(fitted.cluster_centers_.tolist())},
"n_iter": {"type": "Double", "basicValue": float(fitted.n_iter_)},
"n_steps": {"type": "Double", "basicValue": float(fitted.n_steps_)}
}
}
except Exception as e:
return f"Error: {str(e)}"