pyspark.sql.plot.core.PySparkPlotAccessor.kde#
- PySparkPlotAccessor.kde(bw_method, column=None, ind=None, **kwargs)[source]#
Generate Kernel Density Estimate plot using Gaussian kernels.
In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function (PDF) of a random variable. This function uses Gaussian kernels and includes automatic bandwidth determination.
- Parameters
- bw_methodint or float
The method used to calculate the estimator bandwidth. See KernelDensity in PySpark for more information.
- column: str or list of str, optional
Column name or list of names to be used for creating the kde plot. If None (default), all numeric columns will be used. If no numeric columns exist, behavior may depend on the plot backend.
- indList of float, NumPy array or integer, optional
Evaluation points for the estimated PDF. If None (default), 1000 equally spaced points are used. If ind is a NumPy array, the KDE is evaluated at the points passed. If ind is an integer, ind number of equally spaced points are used.
- **kwargsoptional
Additional keyword arguments.
- Returns
plotly.graph_objs.Figure
Examples
>>> from pyspark.sql import SparkSession >>> spark = SparkSession.builder.getOrCreate() >>> data = [(5.1, 3.5, 0), (4.9, 3.0, 0), (7.0, 3.2, 1), (6.4, 3.2, 1), (5.9, 3.0, 2)] >>> columns = ["length", "width", "species"] >>> df = spark.createDataFrame(data, columns) >>> df.plot.kde(bw_method=0.3, ind=100)