DBSCAN
DBSCAN groups samples into dense regions without requiring a fixed cluster count in advance. Samples that do not belong to any dense region are labeled as noise.
The algorithm is based on the concept of an \epsilon-neighborhood for a given point x:
N_{\epsilon}(x) = \{ y \in D \mid \text{dist}(x, y) \le \epsilon \}
A point x is considered a core point if the number of points in its neighborhood is at least min_samples (i.e., |N_{\epsilon}(x)| \ge \text{min\_samples}).
This wrapper accepts data with rows as samples and columns as features. It returns fitted labels, a compact label count table, the number of discovered clusters excluding noise, and the number of noise points.
Excel Usage
=DBSCAN(data, eps, min_samples, dbscan_metric)
data(list[list], required): 2D array of input data with rows as samples and columns as features.eps(float, optional, default: 0.5): Maximum neighborhood radius for connecting nearby samples.min_samples(int, optional, default: 5): Minimum neighborhood size needed for a core sample.dbscan_metric(str, optional, default: “euclidean”): Distance metric used to measure neighborhood reachability.
Returns (dict): Excel data type containing cluster counts, labels, label counts, and core sample indices.
Example 1: Detect two dense groups and one noise point
Inputs:
| data | eps | min_samples | dbscan_metric | |
|---|---|---|---|---|
| 0 | 0 | 0.25 | 2 | euclidean |
| 0 | 0.1 | |||
| 5 | 5 | |||
| 5.1 | 5 | |||
| 10 | 10 |
Excel formula:
=DBSCAN({0,0;0,0.1;5,5;5.1,5;10,10}, 0.25, 2, "euclidean")
Expected output:
{"type":"Double","basicValue":2,"properties":{"cluster_count":{"type":"Double","basicValue":2},"noise_count":{"type":"Double","basicValue":1},"core_sample_count":{"type":"Double","basicValue":4},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":-1}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":-1},{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":2}],[{"type":"Double","basicValue":1},{"type":"Double","basicValue":2}]]},"core_sample_indices":{"type":"Array","elements":[[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":2}],[{"type":"Double","basicValue":3}]]}}}
Example 2: Treat a single Excel cell as one dense sample
Inputs:
| data | eps | min_samples | dbscan_metric |
|---|---|---|---|
| 5 | 0.5 | 1 | euclidean |
Excel formula:
=DBSCAN(5, 0.5, 1, "euclidean")
Expected output:
{"type":"Double","basicValue":1,"properties":{"cluster_count":{"type":"Double","basicValue":1},"noise_count":{"type":"Double","basicValue":0},"core_sample_count":{"type":"Double","basicValue":1},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":0}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":1}]]},"core_sample_indices":{"type":"Array","elements":[[{"type":"Double","basicValue":0}]]}}}
Example 3: Use Manhattan distance for two rectangular groups
Inputs:
| data | eps | min_samples | dbscan_metric | |
|---|---|---|---|---|
| 0 | 0 | 1.5 | 2 | manhattan |
| 0 | 1 | |||
| 0 | 2 | |||
| 6 | 6 | |||
| 6 | 7 | |||
| 12 | 12 |
Excel formula:
=DBSCAN({0,0;0,1;0,2;6,6;6,7;12,12}, 1.5, 2, "manhattan")
Expected output:
{"type":"Double","basicValue":2,"properties":{"cluster_count":{"type":"Double","basicValue":2},"noise_count":{"type":"Double","basicValue":1},"core_sample_count":{"type":"Double","basicValue":5},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":-1}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":-1},{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":1},{"type":"Double","basicValue":2}]]},"core_sample_indices":{"type":"Array","elements":[[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":2}],[{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":4}]]}}}
Example 4: Cluster one-dimensional samples with the L1 metric
Inputs:
| data | eps | min_samples | dbscan_metric |
|---|---|---|---|
| 0 | 0.25 | 2 | l1 |
| 0.1 | |||
| 0.2 | |||
| 4.8 | |||
| 5 | |||
| 8 |
Excel formula:
=DBSCAN({0;0.1;0.2;4.8;5;8}, 0.25, 2, "l1")
Expected output:
{"type":"Double","basicValue":2,"properties":{"cluster_count":{"type":"Double","basicValue":2},"noise_count":{"type":"Double","basicValue":1},"core_sample_count":{"type":"Double","basicValue":5},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":-1}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":-1},{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":1},{"type":"Double","basicValue":2}]]},"core_sample_indices":{"type":"Array","elements":[[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":2}],[{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":4}]]}}}
Python Code
import numpy as np
from sklearn.cluster import DBSCAN as SklearnDBSCAN
def dbscan(data, eps=0.5, min_samples=5, dbscan_metric='euclidean'):
"""
Cluster samples by density and mark outliers as noise.
See: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
This example function is provided as-is without any representation of accuracy.
Args:
data (list[list]): 2D array of input data with rows as samples and columns as features.
eps (float, optional): Maximum neighborhood radius for connecting nearby samples. Default is 0.5.
min_samples (int, optional): Minimum neighborhood size needed for a core sample. Default is 5.
dbscan_metric (str, optional): Distance metric used to measure neighborhood reachability. Valid options: Euclidean, Manhattan, L1, L2, Cosine. Default is 'euclidean'.
Returns:
dict: Excel data type containing cluster counts, labels, label counts, and core sample indices.
"""
def to2d(value):
return [[value]] if not isinstance(value, list) else value
def parse_matrix(value):
value = to2d(value)
if not isinstance(value, list) or not value or not all(isinstance(row, list) and row for row in value):
return None, "Error: data must be a non-empty 2D list"
if len({len(row) for row in value}) != 1:
return None, "Error: data must be a rectangular 2D list"
matrix = np.array(value, dtype=float)
if matrix.ndim != 2 or matrix.size == 0:
return None, "Error: data must be a non-empty 2D list"
if not np.isfinite(matrix).all():
return None, "Error: data must contain only finite numeric values"
return matrix, None
def as_column(values):
return [[{"type": "Double", "basicValue": float(item)}] for item in values]
def label_count_table(labels):
unique_labels, counts = np.unique(labels, return_counts=True)
rows = [[{"type": "String", "basicValue": "label"}, {"type": "String", "basicValue": "count"}]]
rows.extend(
[[{"type": "Double", "basicValue": float(label)}, {"type": "Double", "basicValue": float(count)}]
for label, count in zip(unique_labels.tolist(), counts.tolist())]
)
return rows
try:
data_np, error = parse_matrix(data)
if error:
return error
if float(eps) <= 0:
return "Error: eps must be greater than 0"
if int(min_samples) < 1:
return "Error: min_samples must be at least 1"
metric_value = str(dbscan_metric).strip()
if metric_value not in {"euclidean", "manhattan", "l1", "l2", "cosine"}:
return "Error: metric must be one of 'euclidean', 'manhattan', 'l1', 'l2', or 'cosine'"
fitted = SklearnDBSCAN(
eps=float(eps),
min_samples=int(min_samples),
metric=metric_value
).fit(data_np)
labels = fitted.labels_
noise_count = int(np.sum(labels == -1))
cluster_count = int(len([label for label in np.unique(labels).tolist() if label != -1]))
properties = {
"cluster_count": {"type": "Double", "basicValue": float(cluster_count)},
"noise_count": {"type": "Double", "basicValue": float(noise_count)},
"core_sample_count": {"type": "Double", "basicValue": float(len(fitted.core_sample_indices_))},
"labels": {"type": "Array", "elements": as_column(labels.tolist())},
"label_counts": {"type": "Array", "elements": label_count_table(labels)}
}
if len(fitted.core_sample_indices_) > 0:
properties["core_sample_indices"] = {"type": "Array", "elements": as_column(fitted.core_sample_indices_.tolist())}
return {
"type": "Double",
"basicValue": float(cluster_count),
"properties": properties
}
except Exception as e:
return f"Error: {str(e)}"