DBSCAN

DBSCAN groups samples into dense regions without requiring a fixed cluster count in advance. Samples that do not belong to any dense region are labeled as noise.

The algorithm is based on the concept of an \epsilon-neighborhood for a given point x:

N_{\epsilon}(x) = \{ y \in D \mid \text{dist}(x, y) \le \epsilon \}

A point x is considered a core point if the number of points in its neighborhood is at least min_samples (i.e., |N_{\epsilon}(x)| \ge \text{min\_samples}).

This wrapper accepts data with rows as samples and columns as features. It returns fitted labels, a compact label count table, the number of discovered clusters excluding noise, and the number of noise points.

Excel Usage

=DBSCAN(data, eps, min_samples, dbscan_metric)
  • data (list[list], required): 2D array of input data with rows as samples and columns as features.
  • eps (float, optional, default: 0.5): Maximum neighborhood radius for connecting nearby samples.
  • min_samples (int, optional, default: 5): Minimum neighborhood size needed for a core sample.
  • dbscan_metric (str, optional, default: “euclidean”): Distance metric used to measure neighborhood reachability.

Returns (dict): Excel data type containing cluster counts, labels, label counts, and core sample indices.

Example 1: Detect two dense groups and one noise point

Inputs:

data eps min_samples dbscan_metric
0 0 0.25 2 euclidean
0 0.1
5 5
5.1 5
10 10

Excel formula:

=DBSCAN({0,0;0,0.1;5,5;5.1,5;10,10}, 0.25, 2, "euclidean")

Expected output:

{"type":"Double","basicValue":2,"properties":{"cluster_count":{"type":"Double","basicValue":2},"noise_count":{"type":"Double","basicValue":1},"core_sample_count":{"type":"Double","basicValue":4},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":-1}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":-1},{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":2}],[{"type":"Double","basicValue":1},{"type":"Double","basicValue":2}]]},"core_sample_indices":{"type":"Array","elements":[[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":2}],[{"type":"Double","basicValue":3}]]}}}

Example 2: Treat a single Excel cell as one dense sample

Inputs:

data eps min_samples dbscan_metric
5 0.5 1 euclidean

Excel formula:

=DBSCAN(5, 0.5, 1, "euclidean")

Expected output:

{"type":"Double","basicValue":1,"properties":{"cluster_count":{"type":"Double","basicValue":1},"noise_count":{"type":"Double","basicValue":0},"core_sample_count":{"type":"Double","basicValue":1},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":0}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":1}]]},"core_sample_indices":{"type":"Array","elements":[[{"type":"Double","basicValue":0}]]}}}

Example 3: Use Manhattan distance for two rectangular groups

Inputs:

data eps min_samples dbscan_metric
0 0 1.5 2 manhattan
0 1
0 2
6 6
6 7
12 12

Excel formula:

=DBSCAN({0,0;0,1;0,2;6,6;6,7;12,12}, 1.5, 2, "manhattan")

Expected output:

{"type":"Double","basicValue":2,"properties":{"cluster_count":{"type":"Double","basicValue":2},"noise_count":{"type":"Double","basicValue":1},"core_sample_count":{"type":"Double","basicValue":5},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":-1}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":-1},{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":1},{"type":"Double","basicValue":2}]]},"core_sample_indices":{"type":"Array","elements":[[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":2}],[{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":4}]]}}}

Example 4: Cluster one-dimensional samples with the L1 metric

Inputs:

data eps min_samples dbscan_metric
0 0.25 2 l1
0.1
0.2
4.8
5
8

Excel formula:

=DBSCAN({0;0.1;0.2;4.8;5;8}, 0.25, 2, "l1")

Expected output:

{"type":"Double","basicValue":2,"properties":{"cluster_count":{"type":"Double","basicValue":2},"noise_count":{"type":"Double","basicValue":1},"core_sample_count":{"type":"Double","basicValue":5},"labels":{"type":"Array","elements":[[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":-1}]]},"label_counts":{"type":"Array","elements":[[{"type":"String","basicValue":"label"},{"type":"String","basicValue":"count"}],[{"type":"Double","basicValue":-1},{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":1},{"type":"Double","basicValue":2}]]},"core_sample_indices":{"type":"Array","elements":[[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":1}],[{"type":"Double","basicValue":2}],[{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":4}]]}}}

Python Code

import numpy as np
from sklearn.cluster import DBSCAN as SklearnDBSCAN

def dbscan(data, eps=0.5, min_samples=5, dbscan_metric='euclidean'):
    """
    Cluster samples by density and mark outliers as noise.

    See: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        data (list[list]): 2D array of input data with rows as samples and columns as features.
        eps (float, optional): Maximum neighborhood radius for connecting nearby samples. Default is 0.5.
        min_samples (int, optional): Minimum neighborhood size needed for a core sample. Default is 5.
        dbscan_metric (str, optional): Distance metric used to measure neighborhood reachability. Valid options: Euclidean, Manhattan, L1, L2, Cosine. Default is 'euclidean'.

    Returns:
        dict: Excel data type containing cluster counts, labels, label counts, and core sample indices.
    """
    def to2d(value):
        return [[value]] if not isinstance(value, list) else value

    def parse_matrix(value):
        value = to2d(value)
        if not isinstance(value, list) or not value or not all(isinstance(row, list) and row for row in value):
            return None, "Error: data must be a non-empty 2D list"
        if len({len(row) for row in value}) != 1:
            return None, "Error: data must be a rectangular 2D list"
        matrix = np.array(value, dtype=float)
        if matrix.ndim != 2 or matrix.size == 0:
            return None, "Error: data must be a non-empty 2D list"
        if not np.isfinite(matrix).all():
            return None, "Error: data must contain only finite numeric values"
        return matrix, None

    def as_column(values):
        return [[{"type": "Double", "basicValue": float(item)}] for item in values]

    def label_count_table(labels):
        unique_labels, counts = np.unique(labels, return_counts=True)
        rows = [[{"type": "String", "basicValue": "label"}, {"type": "String", "basicValue": "count"}]]
        rows.extend(
            [[{"type": "Double", "basicValue": float(label)}, {"type": "Double", "basicValue": float(count)}]
             for label, count in zip(unique_labels.tolist(), counts.tolist())]
        )
        return rows

    try:
        data_np, error = parse_matrix(data)
        if error:
            return error

        if float(eps) <= 0:
            return "Error: eps must be greater than 0"
        if int(min_samples) < 1:
            return "Error: min_samples must be at least 1"

        metric_value = str(dbscan_metric).strip()
        if metric_value not in {"euclidean", "manhattan", "l1", "l2", "cosine"}:
            return "Error: metric must be one of 'euclidean', 'manhattan', 'l1', 'l2', or 'cosine'"

        fitted = SklearnDBSCAN(
            eps=float(eps),
            min_samples=int(min_samples),
            metric=metric_value
        ).fit(data_np)

        labels = fitted.labels_
        noise_count = int(np.sum(labels == -1))
        cluster_count = int(len([label for label in np.unique(labels).tolist() if label != -1]))
        properties = {
            "cluster_count": {"type": "Double", "basicValue": float(cluster_count)},
            "noise_count": {"type": "Double", "basicValue": float(noise_count)},
            "core_sample_count": {"type": "Double", "basicValue": float(len(fitted.core_sample_indices_))},
            "labels": {"type": "Array", "elements": as_column(labels.tolist())},
            "label_counts": {"type": "Array", "elements": label_count_table(labels)}
        }

        if len(fitted.core_sample_indices_) > 0:
            properties["core_sample_indices"] = {"type": "Array", "elements": as_column(fitted.core_sample_indices_.tolist())}

        return {
            "type": "Double",
            "basicValue": float(cluster_count),
            "properties": properties
        }
    except Exception as e:
        return f"Error: {str(e)}"

Online Calculator

2D array of input data with rows as samples and columns as features.
Maximum neighborhood radius for connecting nearby samples.
Minimum neighborhood size needed for a core sample.
Distance metric used to measure neighborhood reachability.