Example usage¶

[1]:

import skdim
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

Generating data¶

Generated data are np.array (n_points x n_dim). Here we generate two datasets that contain clusters with different intrinsic dimensions (ID)

[2]:

#generate data
data1, clusters = skdim.datasets.lineDiskBall(n = 2000, random_state = 0)
data2 = skdim.datasets.swissRoll3Sph(n_swiss=4000,n_sphere=2000, h=2, random_state = 0)

#plot
fig = make_subplots(rows=1, cols=2,specs=[[{'type': 'Scatter3d'}]*2])

trace1=go.Scatter3d(dict(zip(['x','y','z'],data1.T[:3])),
        mode='markers',marker=dict(size=1.5,colorbar=dict()))
trace2=go.Scatter3d(dict(zip(['x','y','z'],data2.T[:3])),
        mode='markers',marker=dict(size=1.5,colorbar=dict()))

fig.add_traces([trace1,trace2],rows=1,cols=[1,2])
fig.layout.update(height=450, width=800)
fig.show(renderer="notebook")

Estimating global ID¶

Estimators can compute global ID estimates (for the entire dataset).

This can be done with any of these calls: - .fit(data).dimension_ - .fit(data).transform() - .fit_transform(data)

[3]:

pca=skdim.id.lPCA()

#global ID
gid1=pca.fit(data1).dimension_
gid2=pca.fit(data2).dimension_
print(gid1,gid2)

2 3

Estimating local ID¶

Estimators can also compute local ID estimates using k-NN around each point, which is useful to find if a dataset has regions with different IDs.

To compute local ID call either: - .fit_pw(data).dimension_pw_ - .fit_pw(data).transform_pw() - .fit_transform_pw(data)

Several estimators also naturally compute the distribution of local ID to derive the global ID estimates.

These estimators do not have a .fit_pw() method since .fit() already computes local estimates and you can simply call : - .fit(data).dimension_pw_ - .fit(data).transform_pw() - .fit_transform_pw(data)

[4]:

#local ID (pointwise estimates)
lid1=pca.fit_pw(data1,n_neighbors=25).dimension_pw_
lid2=pca.fit_pw(data2,n_neighbors=25).dimension_pw_

[5]:

fig.update_traces({'text':lid1,'marker.color':lid1,'marker.colorbar':dict(thickness=5,x=.42)},col=1)
fig.update_traces({'text':lid2,'marker.color':lid2,'marker.colorbar':dict(thickness=5,x=.98)},col=2)
fig.show(renderer="notebook")

Benchmarking estimators¶

Estimators are commonly benchmarked using various synthetic manifolds with known ID.

Using datasets.BenchmarkManifolds, one can generate a set of standard manifolds used in most research papers.

[6]:

benchmark = skdim.datasets.BenchmarkManifolds(random_state=0)
#dictionary with all datasets
dict_data = benchmark.generate()
#ground truth dataframe
truth = benchmark.truth
#generate a dataset with custom parameters
M1_sphere_custom = benchmark.generate(name="M1_Sphere",n=2500,dim=10,d=5)

M1_sphere = dict_data['M1_Sphere'] #np.array (n x dim)
truth

[6]:

	Intrinsic Dimension	Number of variables	Description
M1_Sphere	10	11	10D sphere linearly embedded
M2_Affine_3to5	3	5	Affine space
M3_Nonlinear_4to6	4	6	Concentrated figure, mistakable with a 3D one
M4_Nonlinear	4	8	Nonlinear manifold
M5a_Helix1d	1	3	1D helix
M5b_Helix2d	2	3	2D helix
M6_Nonlinear	6	36	Nonlinear manifold
M7_Roll	2	3	Swiss Roll
M8_Nonlinear	12	72	Nonlinear (highly curved) manifold
M9_Affine	20	20	Affine space
M10a_Cubic	10	11	10D hypercube
M10b_Cubic	17	18	17D hypercube
M10c_Cubic	24	25	24D hypercube
M10d_Cubic	70	71	70D hypercube
M11_Moebius	2	3	Möebius band 10-times twisted
M12_Norm	20	20	Isotropic multivariate Gaussian
M13a_Scurve	2	3	2D S-curve
M13b_Spiral	1	13	1D helix curve
Mbeta	10	40	Manifold generated with a smooth nonuniform pd...
Mn1_Nonlinear	18	72	Nonlinearly embedded manifold of high ID (see ...
Mn2_Nonlinear	24	96	Nonlinearly embedded manifold of high ID (see ...
Mp1_Paraboloid	3	12	3D paraboloid, nonlinearly embedded in (3(3+1)...
Mp2_Paraboloid	6	21	6D paraboloid, nonlinearly embedded in (3*(6+1...
Mp3_Paraboloid	9	30	9D paraboloid, nonlinearly embedded in (3*(9+1...