Data analysis over massive amounts of data have become a major challenge for scientific discovery as various scientific communities embrace computational techniques and, consequently, generate more data. Graphs have re-emerged as important data structures within this context as many of the data generated and used by these domains can be represented as graphs. Examples include computational biology, chemical data analysis, drug discovery, health informatics analysis, social networking, web link analysis and communication networks. As the amount of data has increased, the management and querying of graph data, along with their analysis, have become an important data management concern.
In the last few years, there have been a tremendous interest in building parallel and distributed graph processing systems. Unfortunately, with the exception of RDF stores, every system uses different datasets and queries to assess its scalability and efficiency. This makes it challenging (and sometimes impossible) to conduct a meaningful comparison. Our aim is to close this gap by introducing Waterloo Graph Benchmark (WGB), and by conducting independent studies on the most promising graph systems in the literature. The value of these studies is not limited by performance evaluation, in many cases we could propose new techniques to solve some system deficiencies or discover the optimal configuration settings.
Towards a Universal Graph Benchmark:
We proposed Waterloo Graph Benchmark (WGB), a benchmark for graph processing systems that offers an efficient generator. This generator can efficiently create dynamic large graphs with properties similar to real-life ones. WGB includes the basic graph queries which are used for building graph applications.
Ammar, Khaled, Tamer M. Ozsu. WGB: Towards Universal Graph Benchmark The Fourth Workshop on Big Data Benchmarking, Oct. 2013.
Graph Processing Taxonomy:
This project qualitatively discuss the key characteristics of graph systems. We use these characteristics as dimensions to categorize graph systems. These dimensions are divided in two groups. First, dimensions related to graph processing techniques, such as graph models, workload types, algorithm types, graph dynamicity types, and computation models. The second group has dimensions related to how systems implement these techniques. Implementation details include system architecture, memory model, and partitioning strategies.
Performance Evaluation for Pregel-like Systems:
The introduction of Google’s Pregel generated much interest in the field of large-scale graph data processing, inspiring the development of Pregel-like systems such as Apache Giraph, GPS, Mizan, and GraphLab, all of which have appeared in the past two years. To gain an understanding of how Pregel-like systems perform, we conduct a study to experimentally compare Giraph, GPS, Mizan, and GraphLab on equal ground by considering graph and algorithm agnostic optimizations and by using several metrics. The systems are compared with four different algorithms (PageRank, single source shortest path, weakly connected components, and distributed minimum spanning tree) on up to 128 Amazon EC2 machines. We find that the system optimizations present in Giraph and GraphLab allow them to perform well. Our evaluation also shows Giraph 1.0.0’s considerable improvement since Giraph 0.1 and identifies areas of improvement for all systems.
Minyang Han, Khuzaima Daudjee, Khaled Ammar, M. Tamer Özsu, Xingfang Wang, Tianqi Jin: An Experimental Comparison of Pregel-like Graph Processing Systems. PVLDB 7(12): 1047-1058 (2014)
Performance Evaluation for Map-Reduce Systems:
In this project, we are interested in mapreduce platform and its potential optimization for graph analytics. Most graph analytics algorithms require executing an iterative process until convergence. Although, many graph processing systems were recently proposed, most of them require the input graph and all auxiliary data structure to stay in memory. Therefore, in a shared environment, a workload may wait for very long time before the required resources are available. On the other hand, mapreduce frameworks, such as Hadoop, Haloop, and Spark, can perform the same task with significantly less resources.
Graph Benchmarking As-A-Service:
We would like to make our graph generator and our benchmarking platform available as a service using Amazon AWS machines. This will hopefully simplify the process of evaluating a new system or the process of evaluating a new workload for all systems. We are also planning to build several APIs for our graph benchmark (WGB) to make it accessible for the research community.