VARIANCE_SEL

Variance threshold feature selection removes columns whose training-set variance falls below a chosen cutoff. It is a simple unsupervised filter for dropping constant or nearly constant features before modeling.

The sample variance for a feature X with n samples is computed as:

\text{Var}(X) = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2

where \mu is the mean of the feature.

This wrapper accepts rows as samples and columns as features. It returns the filtered matrix together with per-feature variances, a boolean support mask, and the 1-based indices of the selected columns.

Excel Usage

=VARIANCE_SEL(data, threshold)
  • data (list[list], required): 2D array of numeric input data with rows as samples and columns as features.
  • threshold (float, optional, default: 0): Minimum variance required for a feature to be kept.

Returns (dict): Excel data type containing the selected feature matrix, support mask, and per-feature variances.

Example 1: Remove constant columns with the default threshold

Inputs:

data threshold
1 0 5 10 0
1 1 5 11
1 0 5 12
1 1 5 13

Excel formula:

=VARIANCE_SEL({1,0,5,10;1,1,5,11;1,0,5,12;1,1,5,13}, 0)

Expected output:

{"type":"Double","basicValue":2,"properties":{"selected_feature_count":{"type":"Double","basicValue":2},"sample_count":{"type":"Double","basicValue":4},"feature_count":{"type":"Double","basicValue":4},"selected_indices":{"type":"Array","elements":[[{"type":"Double","basicValue":2}],[{"type":"Double","basicValue":4}]]},"support_mask":{"type":"Array","elements":[[{"type":"Boolean","basicValue":false}],[{"type":"Boolean","basicValue":true}],[{"type":"Boolean","basicValue":false}],[{"type":"Boolean","basicValue":true}]]},"variances":{"type":"Array","elements":[[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":0.25}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":1.25}]]},"selected_data":{"type":"Array","elements":[[{"type":"Double","basicValue":0},{"type":"Double","basicValue":10}],[{"type":"Double","basicValue":1},{"type":"Double","basicValue":11}],[{"type":"Double","basicValue":0},{"type":"Double","basicValue":12}],[{"type":"Double","basicValue":1},{"type":"Double","basicValue":13}]]}}}

Example 2: Drop weakly varying columns with a positive threshold

Inputs:

data threshold
0 1 10 0.01
0.1 1.1 11
0 0.9 12
0.1 1 13
0 1.1 14

Excel formula:

=VARIANCE_SEL({0,1,10;0.1,1.1,11;0,0.9,12;0.1,1,13;0,1.1,14}, 0.01)

Expected output:

{"type":"Double","basicValue":1,"properties":{"selected_feature_count":{"type":"Double","basicValue":1},"sample_count":{"type":"Double","basicValue":5},"feature_count":{"type":"Double","basicValue":3},"selected_indices":{"type":"Array","elements":[[{"type":"Double","basicValue":3}]]},"support_mask":{"type":"Array","elements":[[{"type":"Boolean","basicValue":false}],[{"type":"Boolean","basicValue":false}],[{"type":"Boolean","basicValue":true}]]},"variances":{"type":"Array","elements":[[{"type":"Double","basicValue":0.0024}],[{"type":"Double","basicValue":0.0056}],[{"type":"Double","basicValue":2}]]},"selected_data":{"type":"Array","elements":[[{"type":"Double","basicValue":10}],[{"type":"Double","basicValue":11}],[{"type":"Double","basicValue":12}],[{"type":"Double","basicValue":13}],[{"type":"Double","basicValue":14}]]}}}

Example 3: Keep a single varying feature column

Inputs:

data threshold
2 0
3
5
8
13

Excel formula:

=VARIANCE_SEL({2;3;5;8;13}, 0)

Expected output:

{"type":"Double","basicValue":1,"properties":{"selected_feature_count":{"type":"Double","basicValue":1},"sample_count":{"type":"Double","basicValue":5},"feature_count":{"type":"Double","basicValue":1},"selected_indices":{"type":"Array","elements":[[{"type":"Double","basicValue":1}]]},"support_mask":{"type":"Array","elements":[[{"type":"Boolean","basicValue":true}]]},"variances":{"type":"Array","elements":[[{"type":"Double","basicValue":11}]]},"selected_data":{"type":"Array","elements":[[{"type":"Double","basicValue":2}],[{"type":"Double","basicValue":3}],[{"type":"Double","basicValue":5}],[{"type":"Double","basicValue":8}],[{"type":"Double","basicValue":13}]]}}}

Example 4: Keep only the highest-variance columns with a larger threshold

Inputs:

data threshold
10 0 100 30
12 0 110
14 0 130
16 0 160
18 0 200

Excel formula:

=VARIANCE_SEL({10,0,100;12,0,110;14,0,130;16,0,160;18,0,200}, 30)

Expected output:

{"type":"Double","basicValue":1,"properties":{"selected_feature_count":{"type":"Double","basicValue":1},"sample_count":{"type":"Double","basicValue":5},"feature_count":{"type":"Double","basicValue":3},"selected_indices":{"type":"Array","elements":[[{"type":"Double","basicValue":3}]]},"support_mask":{"type":"Array","elements":[[{"type":"Boolean","basicValue":false}],[{"type":"Boolean","basicValue":false}],[{"type":"Boolean","basicValue":true}]]},"variances":{"type":"Array","elements":[[{"type":"Double","basicValue":8}],[{"type":"Double","basicValue":0}],[{"type":"Double","basicValue":1320}]]},"selected_data":{"type":"Array","elements":[[{"type":"Double","basicValue":100}],[{"type":"Double","basicValue":110}],[{"type":"Double","basicValue":130}],[{"type":"Double","basicValue":160}],[{"type":"Double","basicValue":200}]]}}}

Python Code

import numpy as np
from sklearn.feature_selection import VarianceThreshold as SklearnVarianceThreshold

def variance_sel(data, threshold=0):
    """
    Remove low-variance features and return the selected columns with variance summaries.

    See: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        data (list[list]): 2D array of numeric input data with rows as samples and columns as features.
        threshold (float, optional): Minimum variance required for a feature to be kept. Default is 0.

    Returns:
        dict: Excel data type containing the selected feature matrix, support mask, and per-feature variances.
    """
    def py(value):
        return value.item() if isinstance(value, np.generic) else value

    def cell(value):
        value = py(value)
        if isinstance(value, bool):
            return {"type": "Boolean", "basicValue": bool(value)}
        if isinstance(value, (int, float)) and not isinstance(value, bool):
            return {"type": "Double", "basicValue": float(value)}
        return {"type": "String", "basicValue": str(value)}

    def col(values):
        return [[cell(value)] for value in values]

    def mat(values):
        return [[cell(value) for value in row] for row in values]

    def parse_data(value):
        value = [[value]] if not isinstance(value, list) else value
        if not isinstance(value, list) or not value or not all(isinstance(row, list) and row for row in value):
            return None, "Error: data must be a non-empty 2D list"
        if len({len(row) for row in value}) != 1:
            return None, "Error: data must be a rectangular 2D list"
        data_np = np.array(value, dtype=float)
        if data_np.ndim != 2 or data_np.size == 0:
            return None, "Error: data must be a non-empty 2D list"
        if not np.isfinite(data_np).all():
            return None, "Error: data must contain only finite numeric values"
        return data_np, None

    try:
        data_np, error = parse_data(data)
        if error:
            return error

        threshold_value = float(threshold)
        if threshold_value < 0:
            return "Error: threshold must be non-negative"

        fitted = SklearnVarianceThreshold(threshold=threshold_value)
        transformed_np = np.asarray(fitted.fit_transform(data_np), dtype=float)
        variances = np.atleast_1d(np.asarray(fitted.variances_, dtype=float))
        support_mask = fitted.get_support().tolist()
        selected_indices = [index + 1 for index, keep in enumerate(support_mask) if keep]

        return {
            "type": "Double",
            "basicValue": float(len(selected_indices)),
            "properties": {
                "selected_feature_count": {"type": "Double", "basicValue": float(len(selected_indices))},
                "sample_count": {"type": "Double", "basicValue": float(data_np.shape[0])},
                "feature_count": {"type": "Double", "basicValue": float(data_np.shape[1])},
                "selected_indices": {"type": "Array", "elements": col(selected_indices)},
                "support_mask": {"type": "Array", "elements": col(support_mask)},
                "variances": {"type": "Array", "elements": col(variances.tolist())},
                "selected_data": {"type": "Array", "elements": mat(transformed_np.tolist())}
            }
        }
    except Exception as e:
        return f"Error: {str(e)}"

Online Calculator

2D array of numeric input data with rows as samples and columns as features.
Minimum variance required for a feature to be kept.