pyspark.sql.plot.core.PySparkPlotAccessor.kde#

PySparkPlotAccessor.kde(bw_method, column=None, ind=None, **kwargs)[source]#

Generate Kernel Density Estimate plot using Gaussian kernels.

In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function (PDF) of a random variable. This function uses Gaussian kernels and includes automatic bandwidth determination.

Parameters

bw_methodint or float: The method used to calculate the estimator bandwidth. See KernelDensity in PySpark for more information.
column: str or list of str, optional: Column name or list of names to be used for creating the kde plot. If None (default), all numeric columns will be used. If no numeric columns exist, behavior may depend on the plot backend.
indList of float, NumPy array or integer, optional: Evaluation points for the estimated PDF. If None (default), 1000 equally spaced points are used. If ind is a NumPy array, the KDE is evaluated at the points passed. If ind is an integer, ind number of equally spaced points are used.
**kwargsoptional: Additional keyword arguments.

Returns

plotly.graph_objs.Figure

Examples

>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> data = [(5.1, 3.5, 0), (4.9, 3.0, 0), (7.0, 3.2, 1), (6.4, 3.2, 1), (5.9, 3.0, 2)]
>>> columns = ["length", "width", "species"]
>>> df = spark.createDataFrame(data, columns)
>>> df.plot.kde(bw_method=0.3, ind=100)