Example usage

[1]:
import skdim
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

Generating data

Generated data are np.array (n_points x n_dim). Here we generate two datasets that contain clusters with different intrinsic dimensions (ID)

[2]:
#generate data
data1, clusters = skdim.datasets.lineDiskBall(n = 2000, random_state = 0)
data2 = skdim.datasets.swissRoll3Sph(n_swiss=4000,n_sphere=2000, h=2, random_state = 0)

#plot
fig = make_subplots(rows=1, cols=2,specs=[[{'type': 'Scatter3d'}]*2])

trace1=go.Scatter3d(dict(zip(['x','y','z'],data1.T[:3])),
        mode='markers',marker=dict(size=1.5,colorbar=dict()))
trace2=go.Scatter3d(dict(zip(['x','y','z'],data2.T[:3])),
        mode='markers',marker=dict(size=1.5,colorbar=dict()))

fig.add_traces([trace1,trace2],rows=1,cols=[1,2])
fig.layout.update(height=450, width=800)
fig.show(renderer="notebook")

Estimating global ID

Estimators can compute global ID estimates (for the entire dataset).

This can be done with any of these calls: - .fit(data).dimension_ - .fit(data).transform() - .fit_transform(data)

[3]:
pca=skdim.id.lPCA()

#global ID
gid1=pca.fit(data1).dimension_
gid2=pca.fit(data2).dimension_
print(gid1,gid2)
2 3

Estimating local ID

Estimators can also compute local ID estimates using k-NN around each point, which is useful to find if a dataset has regions with different IDs.

To compute local ID call either: - .fit_pw(data).dimension_pw_ - .fit_pw(data).transform_pw() - .fit_transform_pw(data)

Several estimators also naturally compute the distribution of local ID to derive the global ID estimates.

These estimators do not have a .fit_pw() method since .fit() already computes local estimates and you can simply call : - .fit(data).dimension_pw_ - .fit(data).transform_pw() - .fit_transform_pw(data)

[4]:
#local ID (pointwise estimates)
lid1=pca.fit_pw(data1,n_neighbors=25).dimension_pw_
lid2=pca.fit_pw(data2,n_neighbors=25).dimension_pw_
[5]:
fig.update_traces({'text':lid1,'marker.color':lid1,'marker.colorbar':dict(thickness=5,x=.42)},col=1)
fig.update_traces({'text':lid2,'marker.color':lid2,'marker.colorbar':dict(thickness=5,x=.98)},col=2)
fig.show(renderer="notebook")

Benchmarking estimators

Estimators are commonly benchmarked using various synthetic manifolds with known ID.

Using datasets.BenchmarkManifolds, one can generate a set of standard manifolds used in most research papers.

[6]:
benchmark = skdim.datasets.BenchmarkManifolds(random_state=0)
#dictionary with all datasets
dict_data = benchmark.generate()
#ground truth dataframe
truth = benchmark.truth
#generate a dataset with custom parameters
M1_sphere_custom = benchmark.generate(name="M1_Sphere",n=2500,dim=10,d=5)

M1_sphere = dict_data['M1_Sphere'] #np.array (n x dim)
truth
[6]:
Intrinsic Dimension Number of variables Description
M1_Sphere 10 11 10D sphere linearly embedded
M2_Affine_3to5 3 5 Affine space
M3_Nonlinear_4to6 4 6 Concentrated figure, mistakable with a 3D one
M4_Nonlinear 4 8 Nonlinear manifold
M5a_Helix1d 1 3 1D helix
M5b_Helix2d 2 3 2D helix
M6_Nonlinear 6 36 Nonlinear manifold
M7_Roll 2 3 Swiss Roll
M8_Nonlinear 12 72 Nonlinear (highly curved) manifold
M9_Affine 20 20 Affine space
M10a_Cubic 10 11 10D hypercube
M10b_Cubic 17 18 17D hypercube
M10c_Cubic 24 25 24D hypercube
M10d_Cubic 70 71 70D hypercube
M11_Moebius 2 3 Möebius band 10-times twisted
M12_Norm 20 20 Isotropic multivariate Gaussian
M13a_Scurve 2 3 2D S-curve
M13b_Spiral 1 13 1D helix curve
Mbeta 10 40 Manifold generated with a smooth nonuniform pd...
Mn1_Nonlinear 18 72 Nonlinearly embedded manifold of high ID (see ...
Mn2_Nonlinear 24 96 Nonlinearly embedded manifold of high ID (see ...
Mp1_Paraboloid 3 12 3D paraboloid, nonlinearly embedded in (3(3+1)...
Mp2_Paraboloid 6 21 6D paraboloid, nonlinearly embedded in (3*(6+1...
Mp3_Paraboloid 9 30 9D paraboloid, nonlinearly embedded in (3*(9+1...