Machinations


Workshop on Modern Massive Data Sets (MMDS)
July 16, 2014, 9:20 pm
Filed under: Uncategorized | Tags: , ,

[The following blog report on MMDS’14 was written by my student Mahdi Zamani – Ed]

Recently, I attended the MMDS workshop hosted by UC Berkeley. The workshop consisted of 40 talks distributed in four days and a poster session. It focused on algorithmic and statistical problems in large scale data analysis and brought together researchers from several areas including computer science, mathematics, statistics, and Big Data practitioners. In the third day of the workshop, I presented a poster about our recent work on multi-party sorting of large data sets. The following is a brief summary of a few talks that seemed interesting to me and my current research projects. The workshop program is available here.

Dan Suciu from University of Washington talked about running queries on very large databases. He argued that traditionally database queries were measured in terms of disk I/O but in Big Data query processing, since the database is store on distributed clusters, communication is the bottleneck. Now, the problem is to run a query on p servers via a Massive Parallel Communication (MPC) model, where the goal is to minimize the load on each server. In this model, the database records are partitioned among p servers each getting M/p, where M is the total number of records. Then, the database operations (like SQL joins) are performed on the partitions. As an example, Dan mentions a popular operation in Big Data analysis called triangle that is defined over three relations: Q(x,y,z)=R(x,y) join S(y,z) join T(z,x). Using a naive approach, this query can be computed in two communication rounds, doing two separate joins. But, surprisingly, it can be computed in a single communication round with reasonable load on each server using a simple technique described by Dan in [3].

Stanley Hall Auditorium

Bill Howe from University of Washington talked about Myria, a web-based framework written in Java for Big Data query processing. He is motivated by the fact that at least we, as scientists, spend more than 90% of our time handling data for our experiments rather than designing new techniques and algorithms  (I think by scientists he means practitioners here. I believe theoreticians spend much less time working with data 🙂. Bill explains that although current Hadoop optimizations are more than 100 times faster than Hadoop itself, we still need faster and simpler frameworks that can be at least used by non-expert scientists. The main idea behind Myria is to blur the distinction between relational algebra and linear algebra. In other words, every query in the relational domain is translated to simple linear algebraic equations that can be computed very fast. Myria provides an imperative-declarative language that runs queries in parallel.

Yiannis Koutis from University of Puerto Rico talked about spectral algorithms for mining data from large graphs. Spectral analysis is an important tool for graph partitioning, where the set of vertices of a large graph is divided into smaller components such that the number of edges running between separated components is small. This is often an important subproblem for complexity reduction or parallelization. A cut that provides the smallest number of such edges is called the sparsest cut. Such a cut bisects the graph into the most important components. Given an adjacency matrix A of a graph G, the general idea behind spectral graph partitioning is that the eigenvector associated with the second smallest eigenvalue of the Laplacian matrix L of A is the sparsest cut of G (given a matrix A and the degree matrix of A denoted by D, the Laplacian matrix of can be computed as L=D-A). Since finding the exact eigenvalues of the Laplacian matrix is often computationally intensive, Yiannis argues that one can find a good (small) sparse cut by approximating the eigenvalues. However, it is an open problem whether one can find better cuts using another technique with similar computational costs. At the end of the talk, I asked Yiannis whether probabilistic methods (such as random walks) are slower than spectral methods. He answered that probabilistic methods are usually local, and global random walks are often much slower than global spectral analysis.

David Woodruff from IBM Almaden Research talked about an optimal CUR matrix decomposition technique. Given a matrix A, the goal is to find three matrices C, U, and R such that A=C.U.R. A solution to this problem can be used in many applications such as recommendation systems. For example, Netflix is interested in fast techniques for decomposing its huge users-by-movies matrix into a matrix of the most important users and a matrix of the most important movies. Since the exact decomposition is often very computationally expensive, an approximation is calculated instead. Although there are fast randomized algorithms for solving this problem (e.g., [2]), David proposes their asymptotically-optimal deterministic algorithm published recently in STOC 2014 [1].

Xavier Amatriain from Netflix talked about current Big Data challenges of his company. Until 2007, the company had over 100 million movie ratings and now it holds about 5 billion ratings. The company currently has 44 million users around the world and holds more than 5 billion hours of movies. The users play about 50 million movies and submit 5 million ratings per day. This contributes to about 32% of US daily downstream traffic. From this huge pile of data, Netflix needs to extract ranking and similarity information about the movies. Their main approach to this end is to employ distributed machine learning algorithms. Xavier argues that in today’s world more data sometimes beats better algorithms. He backs up his claim by a quote from Peter Norving, Director of Research at Google: “Google does not have better algorithms, only more data” (see this). Xavier continues by arguing that training algorithms can run in parallel because training data are often independent of each other. He also describe how they have scaled their algorithms by distributing at three different levels: across regions or subsets of the population, at the hyperparameter optimization stage, and at the model training level.

[A polished version of this report is available here – Ed]

References

[1] Christos Boutsidis, David P. Woodruff: “Optimal CUR Matrix Decompositions”, Proceedings of the 46th Annual ACM Symposium on Theory of Computing, pp. 353—362, 2014. URL: http://doi.acm.org/10.1145/2591796.2591819.

[2] Michael Mahoney, Petros Drineas: “CUR Matrix Decompositions for Improved Data Analysis”, Proceedings of the National Academy of Sciences, pp. 697—702, 2009.