Tutorial ======== This tutorial provides a step-by-step guide to using NetworkSampler for space-filling sampling on networks and semi-supervised node classification via label propagation. Package information ------------------- Start by checking the installed version and dependencies: .. code-block:: python import networksampler networksampler.info() Generating a test network ------------------------- NetworkSampler includes a utility to generate random networks for testing. For a more realistic scenario, we use NetworkX to generate a network with two communities: .. code-block:: python import networkx as nx import numpy as np # Generate a network with 1000 nodes and two communities G = nx.gaussian_random_partition_graph(1000, 500, 100, 0.20, 0.1) A = nx.adjacency_matrix(G).todense() print(f"Nodes: {A.shape[0]}") print(f"Edges: {int(np.sum(A) / 2)}") Computing the shortest path matrix ----------------------------------- The space-filling algorithm operates on geodesic distances between nodes. You can precompute this matrix once and reuse it across multiple sampling runs: .. code-block:: python D = networksampler.compute_shortest_length_paths_matrix(A) print(f"Matrix shape: {D.shape}") print(f"Max distance: {D.max():.0f}") print(f"Mean distance: {D.mean():.2f}") Space-filling sampling ---------------------- The core function ``sa_sampling`` selects a set of seed nodes that are well spread across the network, minimizing the coverage criterion via simulated annealing: .. code-block:: python # Sample 30 nodes (3% of 1000) nodes, cost = networksampler.sa_sampling(A, 30, p=-4, q=4, r=0.1, D=D) print(f"Sampled nodes: {nodes}") print(f"Coverage cost: {cost:.4f}") The parameters ``p`` and ``q`` control the coverage criterion. The paper recommends ``p=-4`` and ``q=4`` as they generally achieve the best accuracy. The parameter ``r`` is the cooling rate for simulated annealing. If you have a precomputed shortest path matrix, pass it via the ``D`` parameter to avoid recomputing it. For reproducible results, use the ``random_seed`` parameter: .. code-block:: python nodes, cost = networksampler.sa_sampling( A, 30, p=-4, q=4, r=0.1, D=D, random_seed=42 ) Comparing with other sampling strategies ----------------------------------------- NetworkSampler also provides centrality-based random sampling through ``node_random_sample``. You can use this to compare the space-filling design against other strategies: .. code-block:: python # Random sample proportional to degree centrality nodes_degree = networksampler.node_random_sample(A, 30, measure="degree") # Random sample proportional to betweenness centrality nodes_betw = networksampler.node_random_sample(A, 30, measure="betweenness") # Random sample proportional to closeness centrality nodes_close = networksampler.node_random_sample(A, 30, measure="closeness") Available centrality measures are: ``degree``, ``closeness``, ``betweenness``, and ``eigenvector``. Two-phase sampling for large networks -------------------------------------- For networks larger than 5000 nodes, the full shortest path matrix becomes expensive to compute and store. The ``sa_sampling_twophases`` function addresses this with a two-phase procedure: 1. First, a subgraph is sampled using centrality-based node selection 2. Then, the space-filling design is applied on the subgraph .. code-block:: python nodes, cost = networksampler.sa_sampling_twophases( A, ns=10, p=-4, q=4, r=0.1, centrality_measure="betweenness", first_phase_sample_fraction=0.2 ) print(f"Sampled nodes: {nodes}") print(f"Coverage cost: {cost:.4f}") The ``first_phase_sample_fraction`` controls the fraction of nodes retained in the first phase (default: 0.1). The ``centrality_measure`` parameter determines how nodes are selected in the first phase. Computing node centrality measures ----------------------------------- The utility function ``nodes_centrality_measures`` computes multiple centrality metrics for all nodes in the network: .. code-block:: python from networksampler.utils import nodes_centrality_measures pagerank, degree, closeness, betweenness = nodes_centrality_measures(A) # Example: top 5 nodes by betweenness centrality top5 = sorted(betweenness, key=betweenness.get, reverse=True)[:5] print(f"Top 5 nodes by betweenness: {top5}") Label propagation example -------------------------- Here is a complete example combining NetworkSampler with label propagation: .. code-block:: python import networkx as nx import numpy as np import networksampler # 1. Generate a network with two communities G = nx.gaussian_random_partition_graph(500, 250, 50, 0.15, 0.05) A = nx.adjacency_matrix(G).todense() # 2. Assign ground truth labels based on community membership partition = G.graph['partition'] labels = np.zeros(500, dtype=int) for i, community in enumerate(partition): for node in community: labels[node] = i # 3. Select seeds using space-filling design seeds, cost = networksampler.sa_sampling(A, 15, p=-4, q=4, r=0.1) # 4. Label propagation using the graph Laplacian n = A.shape[0] W = np.array(A, dtype=float) D_diag = np.diag(W.sum(axis=1).A1 if hasattr(W.sum(axis=1), 'A1') else W.sum(axis=1)) L = D_diag - W # Partition into unlabeled (u) and seed (s) nodes u_mask = np.ones(n, dtype=bool) u_mask[seeds] = False u_idx = np.where(u_mask)[0] s_idx = seeds Luu = L[np.ix_(u_idx, u_idx)] Lus = L[np.ix_(u_idx, s_idx)] fs = labels[s_idx].astype(float) # Solve for unlabeled nodes fu = np.linalg.solve(Luu, -Lus @ fs) predicted = (fu >= 0.5).astype(int) # 5. Evaluate accuracy accuracy = np.mean(predicted == labels[u_idx]) print(f"Classification accuracy: {accuracy:.4f}")