Given the prevalence of large networks in several areas, we have been working on various graph-mining problems. Our focus has been the design and application of techniques for dealing with the large scale. We have been analyzing graphs under various streaming models and we use sampling techniques to solve problems such as triangle or subgraph counting.
On the more application-oriented efforts, we have performed various analyses of the web graph, designed algorithms to detect web spam, and developed aggregation techniques for sensor networks.
P. Brach, M. Cygan, J. Lacki, and P. Sankowski
Algorithmic Complexity of Power Law Networks
Proc. 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2016), 2016
We define a deterministic condition for checking whether a graph has a power law degree distribution and experimentally validate it on real-world networks. This definition allows us to derive interesting properties of power law networks. We observe that for exponents of the degree distribution in the range [1, 2] such networks exhibit double power law phenomenon that was observed for several real-world networks. Moreover, we give a novel theoretical explanation why many algorithms run faster on real-world data than what is predicted by algorithmic worst-case analysis. We show how to exploit the power law degree distribution to design faster algorithms for a number of classic problems including transitive closure, maximum matching, determinant, PageRank, matrix inverse, counting triangles and finding maximum clique. In contrast to previously done average-case analyses, we believe that this is the first “waterproof’ argument that explains why many real-world networks are easier.
S. Lattanzi, S. Leonardi, V. Mirrokni, and I. Razenshteyn
Robust Hierarchical K-Center Clustering
Proc. 6th ACM International Conference on Innovations on Theoretical Computer Science (ITCS 2015), 2015
A. Epasto, J. Feldman, S. Lattanzi, S. Leonardi, and V. Mirrokni
Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs
Proceedings of the 23rd International Conference on World Wide Web , 2014
We study the problem of computing similarity rankings in large-scale multi-categorical bipartite graphs, where the two sides of the graph represent actors and items, and the items are partitioned into an arbitrary set of categories. The problem has several real-world applications, including identifying competing advertisers and suggesting related queries in an online advertising system or finding users with similar interests and suggesting content to them. In these settings, we are interested in computing on-the-fly rankings of similar actors, given an actor and an arbitrary subset of categories of interest. Two main challenges arise: First, the bipartite graphs are huge and often lopsided (e.g. the system might receive billions of queries while presenting only millions of advertisers). Second, the sheer number of possible combinations of categories prevents the pre-computation of the results for all of them. We present a novel algorithmic framework that addresses both issues for the computation of several graph-theoretical similarity measures, including # common neighbors, and Personalized PageRank. We show how to tackle the imbalance in the graphs to speed up the computation and provide efficient real-time algorithms for computing rankings for an arbitrary subset of categories. Finally, we show experimentally the accuracy of our approach with real-world data, using both public graphs and a very large dataset from Google AdWords.
S. Lattanzi, and S. Leonardi
Efficient Computation of the Weighted Clustering Coefficient
Proc. 11th Workshop on Algorithms and Models for the Web Graph (WAW 2014), 2014
The clustering coefficient of an unweighted network has been extensively used to quantify how tightly connected is the neighbor around a node and it has been widely adopted for assessing the quality of nodes in a social network. The computation of the clustering coefficient is challenging since it requires to count the number of triangles in the graph. Several recent works proposed efficient sampling, streaming and MapReduce algorithms that allow to overcome this computational bottleneck. As a matter of fact, the intensity of the interaction between nodes, that is usually represented with weights on the edges of the graph, is also an important measure of the statistical cohesiveness of a network. Recently various notions of weighted clustering coefficient have been proposed but all those techniques are hard to implement on large-scale graphs. In this work we show how standard sampling techniques can be used to obtain efficient estimators for the most commonly used measures of weighted clustering coefficient. Furthermore we also propose a novel graph-theoretic notion of clustering coefficient in weighted networks.
P. Rozenshtein, A. Anagnostopoulos, A. Gionis, and N. Tatti
Event Detection in Activity Networks [pdf]
Proc. 20th International Conference on Knowledge Discovery and Data Mining (KDD 2014), 2014
With the fast growth of smart devices and social networks, a lot of computing systems collect data that record different types of activities. An important computational challenge is to analyze these data, extract patterns, and understand activity trends. We consider the problem of mining activity networks to identify interesting events, such as a big concert or a demonstration in a city, or a trending keyword in a user community in a social network.
We define an event to be a subset of nodes in the network that are close to each other and have high activity levels. We formalize the problem of event detection using two graph-theoretic formulations. The first one captures the compactness of an event using the sum of distances among all pairs of the event nodes. We show that this formulation can be mapped to the MaxCut problem, and thus, it can be solved by applying standard semidefinite programming techniques. The second formulation captures compactness using a minimum-distance tree. This formulation leads to the prize-collecting Steiner-tree problem, which we solve by adapting existing approximation algorithms. For the two problems we introduce, we also propose efficient and effective greedy approaches and we prove performance guarantees for one of them. We experiment with the proposed algorithms on real datasets from a public bicycling system and a geolocation-enabled social network dataset collected from twitter. The results show that our methods are able to detect meaningful events.
A. Epasto, J. Feldman, S. Lattanzi, S. Leonardi, and V. Mirrokni
Similarity Ranking in Multi-Categorical Bipartite Graphs
Proc. 23rd International World Wide Web Conference (WWW 2014), 2014
We study the problem of computing similarity rankings in large-scale multi-categorical bipartite graphs, where the two sides of the graph represent actors and items, and the items are partitioned into an arbitrary set of categories. The problem has several real-world applications, including identifying competing advertisers and suggesting related queries in an online advertising system or finding users with similar interests and suggesting content to them. In these settings, we are interested in computing on-the-fly rankings of similar actors and items, given an actor and an arbitrary subset of categories of interest. Two main challenges arise: First, the bipartite graphs are huge and often lopsided (e.g. the system might receive billions of queries while presenting only millions of advertisers). Second, the sheer number of possible combinations of categories prevents the pre-computation of the results for all of them.
We present a novel algorithmic framework that addresses both issues for the computation of several graph-theoretical similarity measures, including # common neighbors, and Personalized PageRank. We show how to tackle the imbalance in the graphs to speed up the computation and provide efficient real-time algorithms for computing rankings for an arbitrary subset of categories.
Finally, we show experimentally the accuracy of our approach with real-world data, using both public graphs and a very large dataset from Google AdWords.
A. Anagnostopoulos, A. Dasgupta, and R. Kumar
A Constant-Factor Approximation Algorithm for Co-Clustering [pdf]
Theory of Computing, Volume 8, Number 26, 2012
Co-clustering is the simultaneous partitioning of the rows and columns of a matrix such that the blocks induced by the row/column partitions are good clusters. Motivated by several applications in text mining, market-basket analysis, and bioinformatics, this problem has attracted a lot of attention in the past few years. Unfortunately, to date, most of the algorithmic work on this problem has been heuristic in nature.
In this work we obtain the first approximation algorithms for the co-clustering problem. Our algorithms are simple and provide constant-factor approximations to the optimum. We also show that co-clustering is NP-hard, thereby complementing our algorithmic result.
L. Becchetti, I. Chatzigiannakis, and Y. Giannakopoulos
Streaming Techniques and Data Aggregation in Networks of Tiny Artefacts [pdf]
Computer Science Review, Volume 5, Number 1, 2011
In emerging pervasive scenarios, data is collected by sensing devices in streams that occur at several distributed points of observation. The size of the data typically far exceeds the storage and computational capabilities of the tiny devices that have to collect and process them. A general and challenging task is to allow (some of) the nodes of a pervasive network to collectively perform monitoring of a neighbourhood of interest by issuing continuous aggregate queries on the streams observed in its vicinity. This class of algorithms is fully decentralized and diffusive in nature: collecting all the data at a few central nodes of the network is unfeasible in networks of low capability devices or in the presence of massive data sets. Two main problems arise in this scenario: (i) the intrinsic complexity of maintaining statistics over a data stream whose size greatly exceeds the capabilities of the device that performs the computation; (ii) composing the partial outcomes computed at different points of observation into an accurate, global statistic over a neighbourhood of interest, which entails coping with several problems, last but not least the receipt of duplicate information along multiple paths of diffusion.
Streaming techniques have emerged as powerful tools to achieve the general goals described above, in the first place because they assume a computational model in which computational and storage resources are assumed to be far exceeded by the amount of data on which computation occurs. In this contribution, we review the main streaming techniques and provide a classification of the computational problems and the applications they effectively address, with an emphasis on decentralized scenarios, which are of particular interest in pervasive networks.
L. Becchetti, P. Boldi, C. Castillo, and A. Gionis
Efficient Algorithms for Large-Scale Local Triangle Counting [pdf]
ACM Transactions on Knowledge Discovery From Data, Volume 4, Number 3, 2010
In this article, we study the problem of approximate local triangle counting in large graphs. Namely, given a large graph G=(V,E) we want to estimate as accurately as possible the number of triangles incident to every node v∈V in the graph. We consider the question both for undirected and directed graphs. The problem of computing the global number of triangles in a graph has been considered before, but to our knowledge this is the first contribution that addresses the problem of approximate local triangle counting with a focus on the efficiency issues arising in massive graphs and that also considers the directed case. The distribution of the local number of triangles and the related local clustering coefficient can be used in many interesting applications. For example, we show that the measures we compute can help detect the presence of spamming activity in large-scale Web graphs, as well as to provide useful features for content quality assessment in social networks.
For computing the local number of triangles (undirected and directed), we propose two approximation algorithms, which are based on the idea of min-wise independent permutations [Broder et al. 1998]. Our algorithms operate in a semi-streaming fashion, using O(|V|) space in main memory and performing O(log|V|) sequential scans over the edges of the graph. The first algorithm we describe in this article also uses O(|E|) space of external memory during computation, while the second algorithm uses only main memory. We present the theoretical analysis as well as experimental results on large graphs, demonstrating the practical efficiency of our approach.
L. Becchetti, A. Marchetti-Spaccamela, A. Vitaletti, P. Korteweg, M. Skutella, and L. Stougie
Latency-Constrained Aggregation in Sensor Networks [pdf]
ACM Transactions on Algorithms, Volume 6, Number 1, 2009
A sensor network consists of sensing devices which may exchange data through wireless communication; sensor networks are highly energy constrained since they are usually battery operated. Data aggregation is a possible way to save energy consumption: nodes may delay data in order to aggregate them into a single packet before forwarding them towards some central node (sink). However, many applications impose constraints on the maximum delay of data; this translates into latency constraints for data arriving at the sink.
We study the problem of data aggregation to minimize maximum energy consumption under latency constraints on sensed data delivery, and we assume unique communication paths that form an intree rooted at the sink. We prove that the offline problem is strongly NP-hard and we design a 2-approximation algorithm. The latter uses a novel rounding technique.
Almost all real-life sensor networks are managed online by simple distributed algorithms in the nodes. In this context we consider both the case in which sensor nodes are synchronized or not. We assess the performance of the algorithm by competitive analysis. We also provide lower bounds for the models we consider, in some cases showing optimality of the algorithms we propose. Most of our results also hold when minimizing the total energy consumption of all nodes.
L. Becchetti, C. Castillo, D. Donato, R. Baeza-Yates, and S. Leonardi
Link Analysis for Web Spam Detection [pdf]
ACM Transactions on the Web, Volume 2, Number 1, 2008
We propose link-based techniques for automating the detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The issue of Web spam is widespread and difficult to solve, mostly due to the large size of the Web which means that, in practice, many algorithms are infeasible.
We perform a statistical analysis of a large collection of Web pages. In particular, we compute statistics of the links in the vicinity of every Web page applying rank propagation and probabilistic counting over the entire Web graph in a scalable way. We build several automatic web spam classifiers using different techniques. This paper presents a study of the performance of each of these classifiers alone, as well as their combined performance.
Based on these results we propose spam detection techniques which only consider the link structure of Web, regardless of page contents. These statistical features are used to build a classifier that is tested over a large collection of Web link spam. After ten-fold cross-validation, our best classifiers have a performance comparable to that of state-of-the-art spam classifiers that use content attributes, and orthogonal to their methods.
D. Donato, S. Leonardi, S. Millozzi, and P. Tsaparas
Mining the Inner Structure of the Web Graph
Journal of Physics A: Mathematical and Theoretical, Volume 41, 2008
Despite being the sum of decentralized and uncoordinated efforts by heterogeneous groups and individuals, the World Wide Web exhibits a well defined structure, characterized by several interesting properties. This structure was clearly revealed by Broder et al. (Computer Networks 33, 2000) who presented the evocative bow-tie picture of the Web. Although, the bow-tie structure is a relatively clear abstraction of the macroscopic picture of the Web, it is quite uninformative with respect to the inner details of the Web graph. In this paper, we mine the inner structure of the Web graph. We present a series of measurements on the Web, which offer a better understanding of the individual components of the bow-tie. In the process, we develop algorithmic techniques for performing these measurements. We discover that the scale-free properties permeate all the components of the bow-tie which exhibit the same macroscopic properties as the Web graph itself. However, close inspection reveals that their inner structure is quite distinct. We show that the Web graph does not exhibit self similarity within its components, and we propose a possible alternative picture for the Web graph, as it emerges from our experiments.
L. Becchetti, P. Boldi, C. Castillo, and A. Gionis
Efficient Semi-Streaming Algorithms for Local Triangle Counting in Massive Graphs [pdf]
Proc. 14th International Conference on Knowledge Discovery and Data Mining (KDD 2008), 2008
In this paper we study the problem of local triangle counting in large graphs. Namely, given a large graph G=(V,E) we want to estimate as accurately as possible the number of triangles incident to every node v∈V in the graph. The problem of computing the global number of triangles in a graph has been considered before, but to our knowledge this is the first paper that addresses the problem of local triangle counting with a focus on the efficiency issues arising in massive graphs. The distribution of the local number of triangles and the related local clustering coefficient can be used in many interesting applications. For example, we show that the measures we compute can help to detect the presence of spamming activity in large-scale Web graphs, as well as to provide useful features to assess content quality in social networks.
For computing the local number of triangles we propose two approximation algorithms, which are based on the idea of min-wise independent permutations (Broder et al. 1998). Our algorithms operate in a semi-streaming fashion, using O(|V|) space in main memory and performing O(log|V|) sequential scans over the edges of the graph. The first algorithm we describe in this paper also uses O(|E|) space in external memory during computation, while the second algorithm uses only main memory. We present the theoretical analysis as well as experimental results in massive graphs demonstrating the practical efficiency of our approach.
I. Bordino, D. Donato, A. Gionis, and S. Leonardi
Mining Large Networks with Subgraph Counting
Proc. 8th IEEE International Conference on Data Mining (ICDM 2008), 2008
The problem of mining frequent patterns in networks has many applications, including analysis of complex networks, clustering of graphs, finding communities in social networks, and indexing of graphical and biological databases. Despite this wealth of applications, the current state of the art lacks algorithmic tools for counting the number of subgraphs contained in a large network.
In this paper we develop data-stream algorithms that approximate the number of all subgraphs of three and four vertices in directed and undirected networks. We use the frequency of occurrence of all subgraphs to prove their significance in order to characterize different kinds of networks: we achieve very good precision in clustering networks with similar structure. The significance of our method is supported by the fact that such high precision cannot be achieved when performing clustering based on simpler topological properties, such as degree, assortativity, and eigenvector distributions. We have also tested our techniques using swap randomization.
D. Donato, S. Leonardi, and M. Paniccia
Combining Transitive Trust and Negative Opinions for Better Reputation Management in Social Networks
Proc. 2nd ACM Workshop on Social Network Mining and Analysis (SNA-KDD 2008), 2008
Reputation management is a crucial task in Peer-to-Peer networks, social networks and other decentralized distributed systems. In this work we investigate the role of users' negative opinions in order to devise fully decentralized reputation management mechanisms. Our study is motivated by the limitations of methods proposed in literature that are based on the idea of propagating positive opinions, most notably EigenTrust , a cornerstone method for reputation management in decentralized systems. EigenTrust makes use of a transitive deÞnition of trust: a peer tends to trust those peers who have a high reputation in the opinion of trustworthy peers. While EigenTrust has been shown to be effective against a number of threat attacks from coalitions of malicious players, it does not address properly more sophisticated threat attacks.
In this paper we propose a new approach to the design of fully decentralized reputation mechanisms that combine negative and positive opinions expressed by peers to reach a global consensus on trust and distrust valuess for each peer of the network. We show how these strategies outperform EigenTrust in terms of number of successful transactions against a large set of sophisticated threat attacks posed by coalitions of malicious peers. We also discuss a clustering method that achieves detecting most of the malicious peers with high precision.
D. Donato, L. Laura, S. Leonardi, and S. Millozzi
The Web as a Graph: How Far We Are
ACM Transactions on Internet Technology, Volume 7, Number 1, 2007
In this article we present an experimental study of the properties of webgraphs. We study a large crawl from 2001 of 200M pages and about 1.4 billion edges, made available by the WebBase project at Stanford, as well as several synthetic ones generated according to various models proposed recently. We investigate several topological properties of such graphs, including the number of bipartite cores and strongly connected components, the distribution of degrees and PageRank values and some correlations; we present a comparison study of the models against these measures. Our findings are that (i) the WebBase sample differs slightly from the (older) samples studied in the literature, and (ii) despite the fact that these models do not catch all of its properties, they do exhibit some peculiar behaviors not found, for example, in the models from classical random graph theory.Moreover we developed a software library able to generate and measure massive graphs in secondary memory; this library is publicy available under the GPL licence. We discuss its implementation and some computational issues related to secondary memory graph algorithms.
D. Donato, M. Paniccia, M. Selis, C. Castillo, G. Cortese, and S. Leonardi
New Metrics for Reputation Management in P2P Networks [pdf]
Proc. 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 2007), 2007
In this work we study the effectiveness of mechanisms for decentralized reputation management in P2P networks. We depart from EigenTrust, an algorithm designed for reputation management in file sharing applications over p2p networks. EigenTrust has been proved very effective against three different natural attacks from malicious coalitions while it performs poorly on particular attack organized by two different kinds of malicious peers. We propose various metrics of reputation based on ideas recently introduced for detecting and demoting Web spam. We combine these metrics with the original EigenTrust approach. Our mechanisms are more effective than EigenTrust alone for detecting malicious peers and reducing the number of inauthentic downloads not only for all the cases previously addressed but also for more sophisticated attacks.
L. Buriol, G. Frahling, S. Leonardi, and C. Sohler
Estimating Clustering Indexes in Data Streams
Proc. 15th Annual European Symposium on Algorithms (ESA 2007), 2007
We present random sampling algorithms that with probability at least 1-δ compute a (1±ε)-approximation of the clustering coefficient and of the number of bipartite clique subgraphs of a graph given as an incidence stream of edges. The space used by our algorithm to estimate the clustering coefficient is inversely related to the clustering coefficient of the network itself. The space used by our algorithm to compute the number K3,3 of bipartite cliques is proportional to the ratio between the number of K1,3 and K3,3 in the graph.
Since the space complexity depends only on the structure of the input graph and not on the number of nodes, our algorithms scale very well with increasing graph size. Therefore they provide a basic tool to analyze the structure of dense clusters in large graphs and have many applications in the discovery of web communities, the analysis of the structure of large social networks and the probing of frequent patterns in large graphs.
We implemented both algorithms and evaluated their performance on networks from different application domains and of different size; The largests instance is a webgraph consisting of more than 135 million nodes and 1 billion edges. Both algorithms compute accurate results in reasonable time on the tested instances.
A. Capocci, V. Servedio, F. Colaiori, L. Buriol, D. Donato, S. Leonardi, and G. Caldarelli
Preferential Attachment in the Growth of Social Networks: The Internet Encyclopedia Wikipedia
Physical Review E, Volume 74, Number 3, 2006
We present an analysis of the statistical properties and growth of the free on-line encyclopedia Wikipedia. By describing topics by vertices and hyperlinks between them as edges, we can represent this encyclopedia as a directed graph. The topological properties of this graph are in close analogy with those of the World Wide Web, despite the very different growth mechanism. In particular, we measure a scale-invariant distribution of the in and out degree and we are able to reproduce these features by means of a simple statistical model. As a major consequence, Wikipedia growth can be described by local rules such as the preferential attachment mechanism, though users, who are responsible of its evolution, can act globally on the network.
D. Donato, L. Laura, S. Leonardi, U. Meyer, S. Millozzi, and J. Sibeyn
Algorithms and Experiments for the Webgraph
Journal of Graph Algorithms and Applications, Volume 10, Number 2, 2006
In this paper we present an experimental study of the properties of web graphs. We study a large crawl from 2001 of 200M pages and about 1.4 billion edges made available by the WebBase project at Stanford , and synthetic graphs obtained by the large scale simulation of stochastic graph models for the Webgraph. This work has required the development and the use of external and semi-external algorithms for computing properties of massive graphs, and for the large scale simulation of stochastic graph models. We report our experimental findings on the topological properties of such graphs, describe the algorithmic tools developed within this project and report the experiments on their time performance.
D. Donato, S. Leonardi, and P. Tsaparas
Stability and Similarity of Link Analysis Ranking Algorithms
Internet Mathematics, Volume 3, Number 4, 2006
Recently, there has been a surge of research activity in the area of link analysis ranking, where hyperlink structures are used to determine the relative authority of webpages. One of the seminal works in this area is that of Kleinberg [Kleinberg 98], who proposed the HITS algorithm. In this paper, we undertake a theoretical analysis of the properties of the HITS algorithm on a broad class of random graphs. Working within the framework of Borodin et al. [Borodin et al. 05], we prove that, under some assumptions, on this class (a) the HITS algorithm is stable with high probability and (b) the HITS algorithm is similar to the InDegree heuristic that assigns to each node weight proportional to the number of incoming links. We demonstrate that our results go through for the case that the expected in-degrees of the graph follow a power law distribution. We also study experimentally the similarity between HITS and InDegree, and we investigate the general conditions under which the two algorithms are similar.
L. Buriol, C. Castillo, D. Donato, S. Leonardi, and S. Millozzi
Temporal Analysis of the Wikigraph
Proc. 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006), 2006
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates
Using Rank Propagation and Probabilistic Counting for Link-Based Spam Detection
Proc. 2006 Workshop on Web Mining and Web Usage Analysis (WebKDD 2006), 2006
This paper describes a technique for automating the detection of Web link spam, that is, groups of pages that are linked together with the sole purpose of obtaining an undeservedly high score in search engines. The problem of Web spam is widespread and difficult to solve, mostly due to the large size of web collections that makes many algorithms unfeasible in practice.
L. Buriol, G. Frahling, S. Leonardi, A. Marchetti-Spaccamela, and C. Sohler
Counting Triangles in Data Streams
Proc. 25th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 2006), 2006
We present two space bounded random sampling algorithms that compute an approximation of the number of triangles in an undirected graph given as a stream of edges. Our first algorithm does not make any assumptions on the order of edges in the stream. It uses space that is inversely related to the ratio between the number of triangles and the number of triples with at least one edge in the induced subgraph, and constant expected update time per edge. Our second algorithm is designed for incidence streams (all edges incident to the same vertex appear consecutively). It uses space that is inversely related to the ratio between the number of triangles and length 2 paths in the graph and expected update time O(log|V|(1+s|V|/|E|)), where s is the space requirement of the algorithm. These results significantly improve over previous work [20, 8]. Since the space complexity depends only on the structure of the input graph and not on the number of nodes, our algorithms scale very well with increasing graph size and so they provide a basic tool to analyze the structure of large graphs. They have many applications, for example, in the discovery of Web communities, the computation of clustering and transitivity coefficient, and discovery of frequent patterns in large graphs.
We have implemented both algorithms and evaluated their performance on networks from different application domains. The sizes of the considered graphs varied from about 8,000 nodes and 40,000 edges to 135 million nodes and more than 1 billion edges. For both algorithms we run experiments with parameter s=1,000, 10,000, 100,000, 1,000,000 to evaluate running time and approximation guarantee. Both algorithms appear to be time efficient for these sample sizes. The approximation quality of the first algorithm was varying significantly and even for s=1,000,000 we had more than 10% deviation for more than half of the instances. The second algorithm performed much better and even for s=10,000 we had an average deviation of less than 6% (taken over all but the largest instance for which we could not compute the number of triangles exactly).
L. Becchetti, C. Castillo, D. Donato, and A. Fazzone
A Comparison of Sampling Techniques for Web Graph Characterization
Proc. 2006 ACM Workshop on Link Analysis: Dynamics and Static of Large Networks (LinKDD 2006), 2006
We present a detailed statistical analysis of the characteristics of partial Web graphs obtained by sub-sampling a large collection of Web pages.
We show that in general the macroscopic properties of the Web are better represented by a shallow exploration of a large number of sites than by a deep exploration of a limited set of sites. We also describe and quantify the bias induced by the different sampling strategies, and show that it can be significant even if the sample covers a large fraction of the collection.