Graph classification
Some dataset comes with intrinsic values that can be used as filtrations, e.g., graphs, medical images, molecules. Multiparameter persistence is then very well suited to deal will all of this information at once. To bring this to light, we will consider the BZR graph dataset. It can be found here.
import multipers.data.graphs as mdg
import multipers.ml.signed_measures as mms
import networkx as nx
from random import choice
from os.path import expanduser
import numpy as np
np.random.seed(1)
dataset = "graphs/MUTAG"
path = mdg.DATASET_PATH+dataset
!ls $path ## We assume that the dataset is in this folder. You can modify the variable `mdg.DATASET_PATH` if necessary
graphs.pkl MUTAG.csv Mutag.graph_labels Mutag.node_labels
labels.pkl Mutag.edges MUTAG.hdf5 Mutag.readme
mat Mutag.graph_idx Mutag.link_labels
graphs, labels = mdg.get_graphs(dataset)
nx.draw(choice(graphs))
Graph dataset can be filtered by several filtration : node degrees, intrinsic values, ricci curvature, closeness centrality, heat kernel signature, etc.
# mdg.reset_graphs(dataset=dataset) # Enforce graph to be read from file
filtrations = ["hks_10","degree","hks_1", "cc"]
for f in filtrations:
mdg.compute_filtration(dataset, filtration=f)
graphs, labels = mdg.get_graphs(dataset) # Retrieves these filtrations
g = graphs[0] # First graph of the dataset
g.nodes[0] # First node of the dataset, which holds several filtrations
Computing hks_10: 100%|████████████████████████████████████████████████| 188/188 [00:00<00:00, 837.93it/s]
Computing degree: 100%|██████████████████████████████████████████████| 188/188 [00:00<00:00, 33321.89it/s]
Computing hks_1: 100%|█████████████████████████████████████████████████| 188/188 [00:00<00:00, 833.57it/s]
Computing cc: 100%|███████████████████████████████████████████████████| 188/188 [00:00<00:00, 5421.64it/s]
{'cc': 0.2553191489361702,
'degree': 0.5,
'ricciCurvature': -0.08333333333333326,
'fiedler': np.float64(0.002679210943613063),
'hks_10': np.float64(0.06963539025357962),
'geodesic': 0,
'hks_1': np.float64(0.4486299521593605)}
Similarly to the point clouds, we can create simplextrees, and turn them into signed measures
simplextrees = mdg.Graph2SimplexTrees(filtrations=filtrations).fit_transform(graphs)
signed_measures = mms.FilteredComplex2SignedMeasure(
degrees=[None], n_jobs=1, grid_strategy='exact', enforce_null_mass=True
).fit_transform(simplextrees) # None correspond to the euler characteristic, which is significantly faster to compute on graphs.
# One may want to rescale filtrations w.r.t. each other. This can be done using the SignedMeasureFormatter class
signed_measures = mms.SignedMeasureFormatter(normalize=True, axis=0).fit_transform(signed_measures)
And finally classify these graphs using either a sliced wasserstein kernel, or a convolution.
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from multipers.ml.kernels import DistanceMatrix2Kernel
## Split the data into train test
xtrain,xtest,ytrain,ytest = train_test_split(signed_measures, labels)
## Classification pipeline using the sliced wasserstein kernel
classifier = Pipeline([
("SWD",mms.SignedMeasure2SlicedWassersteinDistance(n_jobs=-1, num_directions=100)),
("KERNEL", DistanceMatrix2Kernel(sigma=.1)),
("SVM", SVC(kernel="precomputed")),
])
## Evaluates the classifier on this dataset.
# Note that there is no cross validation here, so results can be significantly improved
classifier.fit(xtrain,ytrain).score(xtest,ytest)
0.723404255319149