Web mining

Web mining is a large, dynamic, and interdisciplinary research area, which can be broadly divided into three main subareas corresponding to three different knowledge discovery domains: (1) web-content mining, which infers knowledge from the content of web pages; (2) web-structure mining, which extracts information from data describing the organization of web content; and (3) web-usage mining, which captures usage patterns by analyzing data stored in web-server logs.

In our group we are involved in all of these areas. We have developed techniques for analyzing web pages and obtain their summary to use them for contextual advertising. We have shown how to analyze the pages that web users visit to discover similar users, so as to recommend pages from early adopters to other similar users.

We have analyzed the web graph to understand and characterize its structure and to design classifiers that recognize spam web pages.

We have shown how to analyze web query-log data to design static caching strategies by storing web-search results that satisfy a large number of queries, taking into account relevance and diversification using the idea of query-covering. We have also shown how we can use user's browsing behavior to recommend queries by exploiting the so-called query-flow graph.




A. Anagnostopoulos, L. Becchetti, I. Bordino, S. Leonardi, I. Mele, and P. Sankowski

Stochastic Query Covering for Fast Approximate Document Retrieval [pdf]

ACM Transactions on Intelligent Systems, Volume 33, Number 3, 2015

Show abstract

Hide abstract

We design algorithms that, given a collection of documents and a distribution over user queries, return a small subset of the document collection in such a way that we can efficiently provide high quality answers to user queries using only the selected subset. This approach has applications when space is a constraint or when the query-processing time increases significantly with the size of the collection. We study our algorithms through the lens of stochastic analysis and we prove that, even though they use only a small fraction of the entire collection, they can provide answers to most user queries, achieving a performance close to the optimal. To complement our theoretical findings, we experimentally show the versatility of our approach by considering two important cases in the context of web search: In the first case, we favor the retrieval of documents that are very relevant to the query, whereas in the second case we aim for document diversification. Both the theoretical and the experimental analysis provide strong evidence of the potential value of query covering in diverse application scenarios.

A. Anagnostopoulos, L. Becchetti, A. Fazzone, I. Mele, and M. Riondato

The Importance of Being Expert: Efficient Max-Finding in Crowdsourcing [pdf]

Proc. 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD 2015), 2015

Show abstract

Hide abstract

Crowdsourcing is a computational paradigm whose distinctive feature is the involvement of human workers in key steps of the computation. It is used successfully to address problems that would be hard or impossible to solve for machines. As we highlight in this work, the exclusive use of nonexpert individuals may prove ineffective in some cases, especially when the task at hand or the need for accurate solutions demand some degree of specialization to avoid excessive uncertainty and inconsistency in the answers. We address this limitation by proposing an approach that combines the wisdom of the crowd with the educated opinion of experts. We present a computational model for crowdsourcing that envisions two classes of workers with different expertise levels. One of its distinctive features is the adoption of the threshold error model, whose roots are in psychometrics and which we extend from previous theoretical work. Our computational model allows to evaluate the performance of crowdsourcing algorithms with respect to accuracy and cost. We use our model to develop and analyze an algorithm for approximating the best, in a broad sense, of a set of elements. The algorithm uses naïve and expert workers to find an element that is a constant-factor approximation to the best. We prove upper and lower bounds on the number of comparisons needed to solve this problem, showing that our algorithm uses expert and naïve workers optimally up to a constant factor. Finally, we evaluate our algorithm on real and synthetic datasets using the CrowdFlower crowdsourcing platform, showing that our approach is also effective in practice.


A. Epasto, J. Feldman, S. Lattanzi, S. Leonardi, and V. Mirrokni

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Proceedings of the 23rd International Conference on World Wide Web , 2014

Show abstract

Hide abstract

We study the problem of computing similarity rankings in large-scale multi-categorical bipartite graphs, where the two sides of the graph represent actors and items, and the items are partitioned into an arbitrary set of categories. The problem has several real-world applications, including identifying competing advertisers and suggesting related queries in an online advertising system or finding users with similar interests and suggesting content to them. In these settings, we are interested in computing on-the-fly rankings of similar actors, given an actor and an arbitrary subset of categories of interest. Two main challenges arise: First, the bipartite graphs are huge and often lopsided (e.g. the system might receive billions of queries while presenting only millions of advertisers). Second, the sheer number of possible combinations of categories prevents the pre-computation of the results for all of them. We present a novel algorithmic framework that addresses both issues for the computation of several graph-theoretical similarity measures, including # common neighbors, and Personalized PageRank. We show how to tackle the imbalance in the graphs to speed up the computation and provide efficient real-time algorithms for computing rankings for an arbitrary subset of categories. Finally, we show experimentally the accuracy of our approach with real-world data, using both public graphs and a very large dataset from Google AdWords.


I. Mele, F. Bonchi, and A. Gionis

The Early-Adopter Graph and Its Application to Web-Page Recommendation [pdf]

Proc. 21st ACM International Conference on Information and Knowledge Management (CIKM 2012), 2012

Show abstract

Hide abstract

In this paper we present a novel graph-based data abstraction for modeling the browsing behavior of web users. The objective is to identify users who discover interesting pages before others. We call these users early adopters. By tracking the browsing activity of early adopters we can identify new interesting pages early, and recommend these pages to similar users. We focus on news and blog pages, which are more dynamic in nature and more appropriate for recommendation.

Our proposed model is called early-adopter graph. In this graph, nodes represent users and a directed arc between users u and v expresses the fact that u and v visit similar pages and, in particular, that user u tends to visit those pages before user v. The weight of the edge is the degree to which the temporal rule "v visits a page before v" holds.

Based on the early-adopter graph, we build a recommendation system for news and blog pages, which outperforms other out-of-the-shelf recommendation systems based on collaborative filtering.


A. Anagnostopoulos, A. Z. Broder, E. Gabrilovich, V. Josifovski, and L. Riedel

Web Page Summarization for Just-in-Time Contextual Advertising [pdf]

ACM Transactions on Intelligent Systems and Technology, Volume 3, Number 1, 2011

Show abstract

Hide abstract

Contextual advertising is a type of Web advertising, which, given the URL of a Web page, aims to embed into the page the most relevant textual ads available. For static pages that are displayed repeatedly, the matching of ads can be based on prior analysis of their entire content; however, often ads need to be matched to new or dynamically created pages that cannot be processed ahead of time. Analyzing the entire content of such pages on-the-fly entails prohibitive communication and latency costs. To solve the three-horned dilemma of either low relevance or high latency or high load, we propose to use text summarization techniques paired with external knowledge (exogenous to the page) to craft short page summaries in real time.

Empirical evaluation proves that matching ads on the basis of such summaries does not sacrifice relevance, and is competitive with matching based on the entire page content. Specifically, we found that analyzing a carefully selected 6% fraction of the page text can sacrifice only 1%--3% in ad relevance. Furthermore, our summaries are fully compatible with the standard JavaScript mechanisms used for ad placement: they can be produced at ad-display time by simple additions to the usual script, and they only add 500--600 bytes to the usual request. We also compared our summarization approach, which is based on structural properties of the HTML content of the page, with a more principled one based on one of the standard text summarization tools (MEAD), and found their performance to be comparable.

A. Anagnostopoulos, L. Becchetti, I. Mele, S. Leonardi, and P. Sankowski

Stochastic Query Covering [pdf]

Proc. 4th ACM International Conference on Web Search and Data Mining (WSDM 2011), 2011

Show abstract

Hide abstract

In this paper we introduce the problem of query covering as a means to efficiently cache query results. The general idea is to populate the cache with documents that contribute to the result pages of a large number of queries, as opposed to caching the top documents for each query. It turns out that the problem is hard and solving it requires knowledge of the structure of the queries and the results space, as well as knowledge of the input query distribution. We formulate the problem under the framework of stochastic optimization; theoretically it can be seen as a stochastic universal version of set multicover. While the problem is NP-hard to be solved exactly, we show that for any distribution it can be approximated using a simple greedy approach. Our theoretical findings are complemented by experimental activity on real datasets, showing the feasibility and potential interest of query-covering approaches in practice.


A. Anagnostopoulos, L. Becchetti, C. Castillo, and A. Gionis

An Optimization Framework for Query Recommendation [pdf]

Proc. 3rd ACM International Conference on Web Search and Data Mining (WSDM 2010), 2010

Show abstract

Hide abstract

Query recommendation is an integral part of modern search engines. The goal of query recommendation is to facilitate users while searching for information. Query recommendation also allows users to explore concepts related to their information needs.

In this paper, we present a formal treatment of the problem of query recommendation. In our framework we model the querying behavior of users by a probabilistic reformulation graph, or query-flow graph [Boldi et al. CIKM 2008]. A sequence of queries submitted by a user can be seen as a path on this graph. Assigning score values to queries allows us to define suitable utility functions and to consider the expected utility achieved by a reformulation path on the query-flow graph. Providing recommendations can be seen as adding shortcuts in the query-flow graph that "nudge" the reformulation paths of users, in such a way that users are more likely to follow paths with larger expected utility.

We discuss in detail the most important questions that arise in the proposed framework. In particular, we provide examples of meaningful utility functions to optimize, we discuss how to estimate the effect of recommendations on the reformulation probabilities, we address the complexity of the optimization problems that we consider, we suggest efficient algorithmic solutions, and we validate our models and algorithms with extensive experimentation. Our techniques can be applied to other scenarios where user behavior can be modeled as a Markov process.


L. Becchetti, C. Castillo, D. Donato, R. Baeza-Yates, and S. Leonardi

Link Analysis for Web Spam Detection [pdf]

ACM Transactions on the Web, Volume 2, Number 1, 2008

Show abstract

Hide abstract

We propose link-based techniques for automating the detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The issue of Web spam is widespread and difficult to solve, mostly due to the large size of the Web which means that, in practice, many algorithms are infeasible.

We perform a statistical analysis of a large collection of Web pages. In particular, we compute statistics of the links in the vicinity of every Web page applying rank propagation and probabilistic counting over the entire Web graph in a scalable way. We build several automatic web spam classifiers using different techniques. This paper presents a study of the performance of each of these classifiers alone, as well as their combined performance.

Based on these results we propose spam detection techniques which only consider the link structure of Web, regardless of page contents. These statistical features are used to build a classifier that is tested over a large collection of Web link spam. After ten-fold cross-validation, our best classifiers have a performance comparable to that of state-of-the-art spam classifiers that use content attributes, and orthogonal to their methods.

D. Donato, S. Leonardi, S. Millozzi, and P. Tsaparas

Mining the Inner Structure of the Web Graph

Journal of Physics A: Mathematical and Theoretical, Volume 41, 2008

Show abstract

Hide abstract

Despite being the sum of decentralized and uncoordinated efforts by heterogeneous groups and individuals, the World Wide Web exhibits a well defined structure, characterized by several interesting properties. This structure was clearly revealed by Broder et al. (Computer Networks 33, 2000) who presented the evocative bow-tie picture of the Web. Although, the bow-tie structure is a relatively clear abstraction of the macroscopic picture of the Web, it is quite uninformative with respect to the inner details of the Web graph. In this paper, we mine the inner structure of the Web graph. We present a series of measurements on the Web, which offer a better understanding of the individual components of the bow-tie. In the process, we develop algorithmic techniques for performing these measurements. We discover that the scale-free properties permeate all the components of the bow-tie which exhibit the same macroscopic properties as the Web graph itself. However, close inspection reveals that their inner structure is quite distinct. We show that the Web graph does not exhibit self similarity within its components, and we propose a possible alternative picture for the Web graph, as it emerges from our experiments.


D. Donato, L. Laura, S. Leonardi, and S. Millozzi

The Web as a Graph: How Far We Are

ACM Transactions on Internet Technology, Volume 7, Number 1, 2007

Show abstract

Hide abstract

In this article we present an experimental study of the properties of webgraphs. We study a large crawl from 2001 of 200M pages and about 1.4 billion edges, made available by the WebBase project at Stanford, as well as several synthetic ones generated according to various models proposed recently. We investigate several topological properties of such graphs, including the number of bipartite cores and strongly connected components, the distribution of degrees and PageRank values and some correlations; we present a comparison study of the models against these measures. Our findings are that (i) the WebBase sample differs slightly from the (older) samples studied in the literature, and (ii) despite the fact that these models do not catch all of its properties, they do exhibit some peculiar behaviors not found, for example, in the models from classical random graph theory.Moreover we developed a software library able to generate and measure massive graphs in secondary memory; this library is publicy available under the GPL licence. We discuss its implementation and some computational issues related to secondary memory graph algorithms.


C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna

A Reference Collection for Web Spam

ACM SIGIR Forum, Volume 40, Number 2, 2006

Show abstract

Hide abstract

We describe the WEBSPAM-UK2006 collection, a large set of Web pages that have been manually annotated with labels indicating if the hosts are include Web spam aspects or not. This is the first publicly available Web spam collection that includes page contents and links, and that has been labelled by a large and diverse set of judges.

D. Donato, L. Laura, S. Leonardi, U. Meyer, S. Millozzi, and J. Sibeyn

Algorithms and Experiments for the Webgraph

Journal of Graph Algorithms and Applications, Volume 10, Number 2, 2006

Show abstract

Hide abstract

In this paper we present an experimental study of the properties of web graphs. We study a large crawl from 2001 of 200M pages and about 1.4 billion edges made available by the WebBase project at Stanford [19], and synthetic graphs obtained by the large scale simulation of stochastic graph models for the Webgraph. This work has required the development and the use of external and semi-external algorithms for computing properties of massive graphs, and for the large scale simulation of stochastic graph models. We report our experimental findings on the topological properties of such graphs, describe the algorithmic tools developed within this project and report the experiments on their time performance.

L. Buriol, C. Castillo, D. Donato, S. Leonardi, and S. Millozzi

Temporal Analysis of the Wikigraph

Proc. 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006), 2006

Show abstract

Hide abstract

Wikipedia is an online encyclopedia, available in more than 100 languages and comprising over 1 million articles in its English version. If we consider each Wikipedia article as a node and each hyperlink between articles as an arc we have a "Wikigraph", a graph that represents the link structure of Wikipedia. The Wikigraph differs from other Web graphs studied in the literature by the fact that there are explicit timestamps associated with each node's events. This allows us to do a detailed analysis of the Wikipedia evolution over time. In the first part of this study we characterize this evolution in terms of users, editions and articles; in the second part, we depict the temporal evolution of several topological properties of the Wikigraph. The insights obtained from the Wikigraphs can be applied to large Web graphs from which the temporal data is usually not available.

L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates

Using Rank Propagation and Probabilistic Counting for Link-Based Spam Detection

Proc. 2006 Workshop on Web Mining and Web Usage Analysis (WebKDD 2006), 2006

Show abstract

Hide abstract

This paper describes a technique for automating the detection of Web link spam, that is, groups of pages that are linked together with the sole purpose of obtaining an undeservedly high score in search engines. The problem of Web spam is widespread and difficult to solve, mostly due to the large size of web collections that makes many algorithms unfeasible in practice.

L. Becchetti, C. Castillo, D. Donato, and A. Fazzone

A Comparison of Sampling Techniques for Web Graph Characterization

Proc. 2006 ACM Workshop on Link Analysis: Dynamics and Static of Large Networks (LinKDD 2006), 2006

Show abstract

Hide abstract

We present a detailed statistical analysis of the characteristics of partial Web graphs obtained by sub-sampling a large collection of Web pages.

We show that in general the macroscopic properties of the Web are better represented by a shallow exploration of a large number of sites than by a deep exploration of a limited set of sites. We also describe and quantify the bias induced by the different sampling strategies, and show that it can be significant even if the sample covers a large fraction of the collection.