- Go Mobile »
- Access by Staats- und Universitaetsbibliothek Bremen
Higher-order clustering in networks
Phys. Rev. E 97, 052306 – Published 18 May, 2018
DOI: https://doi.org/10.1103/PhysRevE.97.052306
Abstract
A fundamental property of complex networks is the tendency for edges to cluster. The extent of the clustering is typically quantified by the clustering coefficient, which is the probability that a length-2 path is closed, i.e., induces a triangle in the network. However, higher-order cliques beyond triangles are crucial to understanding complex networks, and the clustering behavior with respect to such higher-order network structures is not well understood. Here we introduce higher-order clustering coefficients that measure the closure probability of higher-order network cliques and provide a more comprehensive view of how the edges of complex networks cluster. Our higher-order clustering coefficients are a natural generalization of the traditional clustering coefficient. We derive several properties about higher-order clustering coefficients and analyze them under common random graph models. Finally, we use higher-order clustering coefficients to gain new insights into the structure of real-world networks from several domains.
Physics Subject Headings (PhySH)
Article Text
Networks are a fundamental tool for understanding and modeling complex physical, social, informational, and biological systems . Although such networks are typically sparse, a recurring trait of networks throughout all of these domains is the tendency of edges to appear in small clusters or cliques . In many cases, such clustering can be explained by local evolutionary processes. For example, in social networks, clusters appear due to the formation of triangles where two individuals who share a common friend are more likely to become friends themselves, a process known as triadic closure . Similar triadic closures occur in other networks: In citation networks, two references appearing in the same publication are more likely to be on the same topic and hence more likely to cite each other , and in coauthorship networks, scientists with a mutual collaborator are more likely to collaborate in the future . In other cases, local clustering arises from highly connected functional units operating within a larger system, e.g., metabolic networks are organized by densely connected modules .
The clustering coefficient quantifies the extent to which edges of a network cluster in terms of triangles. The clustering coefficient is defined as the fraction of length-2 paths, or wedges, that are closed with a triangle (Fig. , row
Overview of higher-order clustering coefficients as clique expansion probabilities. The
The clustering coefficient is an important statistic for data modeling in network science , as well as a useful feature in machine-learning pipelines for, e.g., role discovery and anomaly detection . The statistic has also been identified as an important covariate in sociological studies .
However, the clustering coefficient is inherently restrictive as it measures the closure probability of just one simple structure—the triangle. Moreover, higher-order structures such as larger cliques are crucial to the structure and function of complex networks . For example, 4-cliques reveal community structure in word association and protein-protein interaction networks and cliques of sizes 5–7 are more frequent than triangles in many real-world networks with respect to certain null models . However, the extent of clustering of such higher-order structures has neither been well understood nor quantified.
Here we provide a framework to quantify higher-order clustering in networks by measuring the normalized frequency at which higher-order cliques are closed, which we call higher-order clustering coefficients. We derive our higher-order clustering coefficients by extending a novel interpretation of the classical clustering coefficient as a form of clique expansion (Fig. ). We then derive several properties about higher-order clustering coefficients and analyze them under the
Using our theoretical analysis as a guide, we analyze the higher-order clustering behavior of real-world networks from a variety of domains. Conventional wisdom in network science posits that practically all real-world networks exhibit clustering; however, we find that the clustering property only holds up to a certain order. More specifically, once we control for the clustering as measured by the classical clustering coefficient, networks from some domains do not show significant higher-order clustering in terms of higher-order clique closure. Moreover, by examining how the clustering changes with the order of the clustering, we find that each domain of networks has its own higher-order clustering pattern. Since the traditional clustering coefficient only provides one measurement, it does not show such trends by itself. In addition to the theoretical properties and empirical findings exhibited in this paper, our related work also theoretically connects higher-order clustering and community detection .
In this section, we derive our higher-order clustering coefficients and some of their basic properties. We first present an alternative interpretation of the classical clustering coefficient and then show how this novel interpretation seamlessly generalizes to arrive at our definition of higher-order clustering coefficients. We then provide some probabilistic interpretations of higher-order clustering coefficients that will be useful for our subsequent analysis. Throughout this paper, we confine our discussion to homogeneous networks with only one type of node and leave the development of higher-order clustering on bipartite and multilayer networks for further work. Here we give an alternative interpretation of the clustering coefficient that will later allow us to generalize it and quantify clustering of higher-order network structures (this interpretation is summarized in Fig. ). Our interpretation is based on a notion of clique expansion. First, we consider a 2-clique Formally, the classical global clustering coefficient is where We can also reinterpret the local clustering coefficient in this way. In this case, each wedge again consists of a 2-clique and adjacent edge (Fig. , row where where Our alternative interpretation of the clustering coefficient, described above as a form of clique expansion, leads to a natural generalization to higher-order cliques. Instead of expanding 2-cliques to 3-cliques, we expand where We also define higher-order local clustering coefficients: where where To understand how to compute higher-order clustering coefficients, we consider the following useful identity: where From Eq. , it is easy to see that we can compute all local For the global clustering coefficient, note that Thus, it suffices to enumerate To facilitate understanding of higher-order clustering coefficients and to aid our analysis in Sec. , we present a few probabilistic interpretations of the quantities. First, we can interpret The variant of this interpretation for the classical clustering case of For the next probabilistic interpretation, it is useful to analyze the structure of the 1-hop neighborhood graph Any where If we uniformly at random select an Moreover, if we condition on observing an In other words, the product of the higher-order local clustering coefficients of node
We now provide some theoretical analysis of our higher-order clustering coefficients. We first give some extremal bounds on the values that higher-order clustering coefficients can take given the value of the traditional (second-order) clustering coefficient. After, we analyze the values of higher-order clustering coefficients in two common random graph models—the We first analyze the relationships between local higher-order clustering coefficients of different orders. Our technical result is Proposition , which provides essentially tight lower and upper bounds for higher-order local clustering coefficients in terms of the traditional local clustering coefficient. The main ideas of the proof are illustrated in Fig. . Example 1-hop neighborhoods of a node Clearly, To derive the upper bound, consider the 1-hop neighborhood denote the Combining this with Eq. gives where the last equality uses the fact that The upper bound becomes tight when and by Eq. , when By adjusting the ratio The second part of the result requires the neighborhoods to be sufficiently large in order to reach the upper bound. However, we will see later that in some real-world data, there are nodes Next, we analyze higher-order clustering coefficients in two common random graph models: the Erdős-Rényi model with edge probability Now we analyze higher-order clustering coefficients in classical Erdős-Rényi random graph model, where each edge exists independently with probability In the We prove the first part by conditioning on the set of As noted above, the second equality is well defined (with high probability) for small The proof of the second part is essentially the same, except we condition over the set of possible cases where Recall that The above results say that the global, local, and average Similarly to the proof of Proposition , we look at the conditional expectation over Now note that Now, for any small nonnegative integer (recall that Proposition says that even if the second-order local clustering coefficient is large, the We also study higher-order clustering in the small-world random graph model . The model begins with a ring network where each node connects to its With no rewiring ( Applying Eq. , it suffices to show that as which approaches Now we give a derivation of Eq. . We first label the If we ignore lower-order terms Proposition shows that, when
We now analyze the higher-order clustering of real-world networks. We first study how the higher-order global and average clustering coefficients vary as we increase the order We compute and analyze the higher-order clustering for networks from a variety of domains (Table ). We briefly describe the collection of networks and their categorization below: (1) Two synthetic networks—a random instance of an Erdős-Rényi graph with (2) Four neural networks—the complete neural systems of the nematode worms Pristionchus pacificus and Caenorhabditis elegans as well as the neural connections of the Drosophila medulla and mouse retina; (3) Four online social networks—two Facebook friendship networks between students at universities from 2005 (fb-Stanford, fb-Cornell) and two complete online friendship networks (Pokec and Orkut); (4) Four collaboration networks—two coauthorship networks constructed from arxiv submission categories (arxiv-AstroPh and arxiv-HepPh), a coauthorship network constructed from dblp, and the cocommittee membership network of United States congresspersons (congress-committees); (5) Four human communication networks—two email networks (email-Enron-core, email-Eu-core), a Facebook-like messaging network from a college (CollegeMsg), and the edits of user talk pages by other users on Wikipedia (wiki-Talk); and (6) Four technological systems networks—three autonomous systems (oregon2-010526, as-caida-20071105, as-skitter) and a peer-to-peer connection network (p2p-Genutella31). In all cases, we take the edges as undirected, even if the original network data are directed. Table lists the Propositions and say that we should expect the higher-order global and average clustering coefficients to decrease as we increase the order The relationship between the higher-order global clustering coefficient Overall, the trends in the higher-order clustering coefficients can be different within one of our data-set categories but tend to be uniform within a particular domain: The change of While the raw clustering values are informative, it is also useful to compare the clustering to what one expects from null models. We find in the next section that this reveals additional insights into our data. For one real-world network from each data-set category, we also measure the higher-order clustering coefficients with respect to two null models (Table ). First, we compare against the configuration model (CM) that samples uniformly from simple graphs with the same degree distribution . In real-world networks, Second, we use a null model that samples graphs preserving both degree distribution and Our finding about the lack of higher-order clustering in C. elegans agrees with previous results that 4-cliques are underexpressed, while open 3-wedges related to cooperative information propagation are overexpressed . This also provides credence for the “three-layer” model of C. elegans . The observed clustering in the friendship network is consistent with prior work showing the relative infrequency of open We emphasize that simple clique counts are not sufficient to obtain these results. For example, the discrepancy in the third-order average clustering of C. elegans and the MRCN null model is not simply due to the presence of 4-cliques. The original neural network has nearly twice as many 4-cliques (2010) than the samples from the MRCN model (mean 1006.2, standard deviation 73.6), but the third-order clustering coefficient is larger in MRCN. The reason is that clustering coefficients normalize clique counts with respect to opportunities for closure. Thus far, we have analyzed global and average higher-order clustering, which both summarize the clustering of the entire network. In the next section, we look at more localized properties, namely the distribution of higher-order local clustering coefficients and the higher-order average clustering coefficient as a function of node degree. We now examine more localized clustering properties of our networks. Figure (left column) plots the joint distribution of Left column: Joint distributions of ( Analogous plots of Fig. for an instance of (a) Erdős-Rényi and (b) small-world random graphs (see the caption in Fig. for a more complete explanation of the figure). Left column: Joint distributions of ( For many nodes in C. elegans, local clustering is nearly random [Fig. , left], i.e., resembles the Erdős-Rényi joint distribution [Fig. , left]. In other words, there are many nodes that lie on the lower trend line. The fitted linear model of Figures and (right columns) plot higher-order average clustering as a function of node degree in the real-world and synthetic networks. In the Erdős-Rényi, small-world, C. elegans, and Enron email networks, there is a distinct gap between the average higher-order clustering coefficients for nodes of all degrees. Thus, our previous finding that the average clustering coefficient
We have proposed higher-order clustering coefficients to study higher-order closure patterns in networks, which generalizes the widely used clustering coefficient that measures triadic closure. Our work compliments other recent developments on the importance of higher-order information in network navigation and on temporal community structure ; in contrast, we examine higher-order clique closure and only implicitly consider time as a motivation for closure. Extending our ideas to more network models, such as bipartite and multilayer networks, provides an avenue for future research.
Prior efforts in generalizing clustering coefficients have focused on shortest paths , cycle formation , and triangle frequency in
Finally, we focused on higher-order clustering coefficients as a global network measurement and as a node-level measurement. In related work we also show that large higher-order clustering implies the existence of mesoscale clique-dense community structure .
The web site associated with this paper, which includes software for computing higher-order clustering coefficients, is http://snap.stanford.edu/hocc.
This research has been supported in part by NSF Grant No. IIS-1149837, ARO MURI, DARPA, ONR, Huawei, and the Stanford Data Science Initiative. A.R.B. was supported in part by a Simons Investigator Award and NSF TRIPODS Award No. 1740822. We thank Will Hamilton and Marinka Žitnik for insightful comments. We thank Mason Porter and Peter Mucha for providing the congress committee membership data.
References (62)
- M. E. J. Newman, SIAM Rev. 45, 167 (2003).
- A. Rapoport, Bull. Math. Biophys. 15, 523 (1953).
- D. J. Watts and S. H. Strogatz, Nature 393, 440 (1998).
- M. S. Granovetter, Am. J. Sociol. 78, 1360 (1973).
- Z.-X. Wu and P. Holme, Phys. Rev. E 80, 037101 (2009).
- E. M. Jin, M. Girvan, and M. E. J. Newman, Phys. Rev. E 64, 046132 (2001).
- E. Ravasz and A.-L. Barabási, Phys. Rev. E 67, 026112 (2003).
- A. Barrat and M. Weigt, Eur. Phys. J. B 13, 547 (2000).
- M. E. J. Newman, Phys. Rev. Lett. 103, 058701 (2009).
- C. Seshadhri, T. G. Kolda, and A. Pinar, Phys. Rev. E 85, 056109 (2012).
- P. Robles, S. Moreno, and J. Neville, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2016), pp. 1155–1164.
- K. Henderson, B. Gallagher, T. Eliassi-Rad, H. Tong, S. Basu, L. Akoglu, D. Koutra, C. Faloutsos, and L. Li, in Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2012), pp. 1231–1239.
- T. La Fond, J. Neville, and B. Gallagher, in Outlier Detection and Description under Data Diversity at the International Conference on Knowledge Discovery and Data Mining (2014), http://outlier-analytics.org/odd14kdd/odd-2014-proceedings.pdf.
- P. S. Bearman and J. Moody, Am. J. Publ. Health 94, 89 (2004).
- A. R. Benson, D. F. Gleich, and J. Leskovec, Science 353, 163 (2016).
- Ö. N. Yaveroğlu, N. Malod-Dognin, D. Davis, Z. Levnajic, V. Janjic, R. Karapandza, A. Stojmirovic, and N. Pržulj, Sci. Rep. 4, 4547 (2014).
- M. Rosvall, A. V. Esquivel, A. Lancichinetti, J. D. West, and R. Lambiotte, Nat. Commun. 5 (2014).
- G. Palla, I. Derényi, I. Farkas, and T. Vicsek, Nature 435, 814 (2005).
- N. Slater, R. Itzchack, and Y. Louzoun, Network Science 2, 387 (2014).
- H. Yin, A. R. Benson, J. Leskovec, and D. F. Gleich, in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada (ACM, New York, NY, 2017), pp. 555–564.
- S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and D.-U. Hwang, Phys. Rep. 424, 175 (2006).
- R. D. Luce and A. D. Perry, Psychometrika 14, 95 (1949).
- N. Chiba and T. Nishizeki, SIAM J. Comput. 14, 210 (1985).
- F. Harary, Graph Theory (Addison-Wesley, 1969).
- R. M. Karp, in Complexity of Computer Computations (Springer, Berlin, 1972), pp. 85–103.
- C. Seshadhri, A. Pinar, and T. G. Kolda, in Proceedings of the 2013 SIAM International Conference on Data Mining (SIAM, Philadelphia, PA, 2013), pp. 10–18.
- J. B. Kruskal, Math. Optimiz. Techn. 10, 251 (1963).
- G. Katona, in Theory of Graphs: Proceedings of the Colloquium (Akadémiai Kiadó, Budapest, 1966), pp. 187–207.
- P. Erdös and A. Rényi, Publ. Math. (Debrecen) 6, 290 (1959).
- B. Bollobás and P. Erdös, in Mathematical Proceedings of the Cambridge Philosophical Society, Vol. 80 (Cambridge University Press, Cambridge, 1976), pp. 419–427.
- D. J. Bumbarger, M. Riebesell, C. Rödelsperger, and R. J. Sommer, Cell 152, 109 (2013).
- S.-y. Takemura, A. Bharioke, Z. Lu, A. Nern, S. Vitaladevuni, P. K. Rivlin, W. T. Katz, D. J. Olbris, S. M. Plaza, P. Winston et al., Nature 500, 175 (2013).
- M. Helmstaedter, K. L. Briggman, S. C. Turaga, V. Jain, H. S. Seung, and W. Denk, Nature 500, 168 (2013).
- A. L. Traud, P. J. Mucha, and M. A. Porter, Physica A 391, 4165 (2012).
- L. Takac and M. Zabovsky, in International Scientific Conference and International Workshop Present Day Trends of Innovations, Vol. 1 (2012), https://snap.stanford.edu/data/soc-pokec.pdf.
- A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee, in Proceedings of the Seventh ACM SIGCOMM Conference on Internet Measurement (IMC '07), San Diego, CA (ACM, New York, NY, 2007), pp. 29–42.
- J. Leskovec, J. Kleinberg, and C. Faloutsos, ACM Trans. Knowl. Discov. Data 1, 2 (2007).
- M. A. Porter, P. J. Mucha, M. E. J. Newman, and C. M. Warmbrand, Proc. Natl. Acad. Sci. U.S.A. 102, 7057 (2005).
- J. Yang and J. Leskovec, Knowl. Inf. Syst. 42, 181 (2015).
- B. Klimt and Y. Yang, European Conference on Machine Learning (Springer, Berlin, 2004), pp. 217–226.
- P. Panzarasa, T. Opsahl, and K. M. Carley, J. Assoc. Inf. Sci. Technol. 60, 911 (2009).
- J. Leskovec, D. P. Huttenlocher, and J. M. Kleinberg, in Proceedings of the Fourth International Conference on Web and Social Media (AAAI, Washington DC, 2010), pp. 98–105.
- J. Leskovec, J. Kleinberg, and C. Faloutsos, in Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (ACM, New York, 2005), pp. 177–187.
- M. Ripeanu, A. Iamnitchi, and I. Foster, IEEE Internet Comput. 6, 50 (2002).
- M. Kaiser, New J. Phys. 10, 083042 (2008).
- B. Bollobás, Eur. J. Combin. 1, 311 (1980).
- R. Milo, N. Kashtan, S. Itzkovitz, M. E. J. Newman, and U. Alon, arXiv:cond-mat/0312028 (2003).
- J. Park and M. E. J. Newman, Phys. Rev. E 70, 066117 (2004).
- P. Colomer-de Simón, M. Á. Serrano, M. G. Beiró, J. I. Alvarez-Hamelin, and M. Boguñá, Sci. Rep. 3, 2517 (2013).
- S. P. Borgatti and M. G. Everett, Soc. Netw. 21, 375 (2000).
- P. Rombach, M. A. Porter, J. H. Fowler, and P. J. Mucha, SIAM Rev. 59, 619 (2017).
- U. Bhat, P. L. Krapivsky, R. Lambiotte, and S. Redner, Phys. Rev. E 94, 062302 (2016).
- R. Lambiotte, P. L. Krapivsky, U. Bhat, and S. Redner, Phys. Rev. Lett. 117, 218301 (2016).
- R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, Science 298, 824 (2002).
- L. R. Varshney, B. L. Chen, E. Paniagua, D. H. Hall, and D. B. Chklovskii, PLOS Comput. Biol. 7, e1001066 (2011).
- J. Ugander, L. Backstrom, and J. Kleinberg, in Proceedings of the 22nd International Conference on World Wide Web (ACM, New York, 2013), pp. 1307–1318.
- I. Scholtes, arXiv:1702.05499 (2017).
- V. Sekara, A. Stopczynski, and S. Lehmann, Proc. Natl. Acad. Sci. U.S.A. 113, 9977 (2016).
- A. Fronczak, J. A. Hołyst, M. Jedynak, and J. Sienkiewicz, Physica A 316, 688 (2002).
- G. Caldarelli, R. Pastor-Satorras, and A. Vespignani, Euro. Phys. J. B 38, 183 (2004).
- R. F. S. Andrade, J. G. V. Miranda, and T. P. Lobão, Phys. Rev. E 73, 046101 (2006).
- B. Jiang and C. Claramunt, Environ. Plann. B 31, 151 (2004).