Graph classification

Some dataset comes with intrinsic values that can be used as filtrations, e.g., graphs, medical images, molecules. Multiparameter persistence is then very well suited to deal will all of this information at once. To bring this to light, we will consider the BZR graph dataset. It can be found here.

import multipers.data.graphs as mdg
import multipers.ml.signed_measures as mms
import networkx as nx
from random import choice
from os.path import expanduser
import numpy as np
np.random.seed(1)
dataset = "graphs/MUTAG"
path = mdg.DATASET_PATH+dataset
!ls $path ## We assume that the dataset is in this folder. You can modify the variable `mdg.DATASET_PATH` if necessary
graphs.pkl  MUTAG.csv	     Mutag.graph_labels  Mutag.node_labels
labels.pkl  Mutag.edges      MUTAG.hdf5		 Mutag.readme
mat	    Mutag.graph_idx  Mutag.link_labels
graphs, labels = mdg.get_graphs(dataset)
nx.draw(choice(graphs))
../_images/0bf9033cbb78a6675984b698ecdd64283c5d6c70e3a49190571d690939f4a919.png

Graph dataset can be filtered by several filtration : node degrees, intrinsic values, ricci curvature, closeness centrality, heat kernel signature, etc.

# mdg.reset_graphs(dataset=dataset) # Enforce graph to be read from file
filtrations = ["hks_10","degree","hks_1", "cc"]
for f in filtrations:
    mdg.compute_filtration(dataset, filtration=f)
graphs, labels = mdg.get_graphs(dataset) # Retrieves these filtrations
g = graphs[0] # First graph of the dataset
g.nodes[0] # First node of the dataset, which holds several filtrations
Computing hks_10: 100%|████████████████████████████████████████████████| 188/188 [00:00<00:00, 837.93it/s]
Computing degree: 100%|██████████████████████████████████████████████| 188/188 [00:00<00:00, 33321.89it/s]
Computing hks_1: 100%|█████████████████████████████████████████████████| 188/188 [00:00<00:00, 833.57it/s]
Computing cc: 100%|███████████████████████████████████████████████████| 188/188 [00:00<00:00, 5421.64it/s]
{'cc': 0.2553191489361702,
 'degree': 0.5,
 'ricciCurvature': -0.08333333333333326,
 'fiedler': np.float64(0.002679210943613063),
 'hks_10': np.float64(0.06963539025357962),
 'geodesic': 0,
 'hks_1': np.float64(0.4486299521593605)}

Similarly to the point clouds, we can create simplextrees, and turn them into signed measures

simplextrees = mdg.Graph2SimplexTrees(filtrations=filtrations).fit_transform(graphs)
signed_measures = mms.FilteredComplex2SignedMeasure(
	degrees=[None], n_jobs=1, grid_strategy='exact', enforce_null_mass=True
).fit_transform(simplextrees) # None correspond to the euler characteristic, which is significantly faster to compute on graphs. 
# One may want to rescale filtrations w.r.t. each other. This can be done using the SignedMeasureFormatter class
signed_measures = mms.SignedMeasureFormatter(normalize=True, axis=0).fit_transform(signed_measures)

And finally classify these graphs using either a sliced wasserstein kernel, or a convolution.

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from multipers.ml.kernels import DistanceMatrix2Kernel
## Split the data into train test
xtrain,xtest,ytrain,ytest = train_test_split(signed_measures, labels)
## Classification pipeline using the sliced wasserstein kernel
classifier = Pipeline([
    ("SWD",mms.SignedMeasure2SlicedWassersteinDistance(n_jobs=-1, num_directions=100)),
    ("KERNEL", DistanceMatrix2Kernel(sigma=.1)),
    ("SVM", SVC(kernel="precomputed")),
])
## Evaluates the classifier on this dataset. 
# Note that there is no cross validation here, so results can be significantly improved
classifier.fit(xtrain,ytrain).score(xtest,ytest)
0.723404255319149