graph.tex 4.78 KB
 yorn committed Jul 03, 2014 1 2 3 4 \chapter{Graph processing} \label{chp:graph} In order to process large graphs, a graph processing system is required.  yorn committed Jul 04, 2014 5 A graph processing system uses a data model of vertices which are connected by edges~\cite{sakr2013processing}.  yorn committed Jul 03, 2014 6 Vertices can have an ID and a value, edges can have a direction and a value.  yorn committed Jul 04, 2014 7 The vertex values are modified during computation, and edges can be added and/or removed during computation.  yorn committed Jul 03, 2014 8 9 10 11 12 13 14 15 16 17 18 19 20 A graph system can scale out horizontally (linearly with size), and is therefore able to process graph data at large scale. A large graph system must also be able to handle failover, such as network, storage or hardware failure. There are multiple graph systems available, for example GraphLab~\cite{low2010graphlab}, GraphX~\cite{xin2013graphx}, Pregel~\cite{Malewicz:2010:PSL:1807167.1807184} and Giraph~\cite{avery2011giraph,ching2013scaling}. This thesis will focus on Giraph, but other systems are expected to give similar results. \section{Bulk Synchronous Parallel model} \label{sec:bsp}  yorn committed Jul 04, 2014 21 \begin{figure}[h]  yorn committed Jul 03, 2014 22 23 24 25 26 27 28 29  \caption{BSP programming model~\cite{sakr2013processing}} \label{fig:BSP} \centering \includegraphics[width=1\textwidth]{BSP} \end{figure} By running calculations in parallel, more computations per second are possible. The case for parallel computing is strengthened by the fact that processing power is cheap nowadays,  yorn committed Jul 04, 2014 30 31 32 33  but fast processing is achieved by using many processors in parallel. Parallel computing does, however, pose an important challenge, in that a single processor cannot share memory with another processor, and synchronisation is expensive, as it requires processors to wait, as memory cannot be reliably changed by two different processes.  yorn committed Jul 03, 2014 34   yorn committed Jul 04, 2014 35 The Bulk Synchronous Parallel model~\cite{Valiant:1990:BMP:79173.79181} (figure~\ref{fig:BSP}) is a model for parallel computing.  yorn committed Jul 03, 2014 36 It consists of components that handle processing and/or memory.  yorn committed Jul 04, 2014 37 38 Operations are divided in so-called supersteps. A router allows components to exchange point to point messages at the end of a superstep execution,  yorn committed Jul 03, 2014 39 40 41 42 43  which allows for storage access between components. The model is useful for running large calculations that will take a very long time with standard hardware. \section{Giraph} Giraph is a software based implementation of the the Bulk Synchronous Parallel model,  yorn committed Jul 04, 2014 44 45  and allows for a large amount of physical machines to form a cluster. A Giraph computation program consists of a single \verb"calculate" function, which has two arguments; (1) the vertex and (2) the messages that were sent to the vertex during the last superstep.  yorn committed Jul 03, 2014 46 During every superstep of the computation, a vertex may send a message to another vertex via one of its outgoing edges.  yorn committed Jul 04, 2014 47 A vertex may also vote to halt'', to indicate that it is done calculating.  yorn committed Jul 03, 2014 48 49 50 If a vertex votes to halt, it will not be called during subsequent supersteps, unless it receives a message from another vertex. A vertex that has received a message will always be called, and any previous votes to halt are forgotten; the vertex will have to vote to halt again.  yorn committed Jul 04, 2014 51 52 Before the computation is started, the graph is read by a \verb"VertexInputFormat" or an \verb"EdgeInputFormat". After computation, the graph is written by a \verb"VertexOutputFormat" or \verb"EdgeOutputFormat"\footnote{\url{https://giraph.apache.org/io.html}}.  yorn committed Jul 03, 2014 53 When the program is started, Giraph will start worker processes on the physical machines.  yorn committed Jul 04, 2014 54 The workers will each read a part of the graph, Giraph will distribute the graph, workers will execute supersteps and after the last step is completed, the workers write their part of the graph.  yorn committed Jul 03, 2014 55 56 57 58 59 60 61  Giraph takes care of the communication between the workers, and distributes the workload over the workers. Due to the nature of BSP, it is important for Giraph that every worker will take roughly the same amount of time to complete a superstep. After each superstep, the nodes have to synchronise and start working on the next superstep, which means that if one worker is slow, all other workers have to wait for it to finish. \section{Design choices} \subsection{Optimisations}  yorn committed Jul 04, 2014 62 63 When implementing a computation for a graph system, such as Giraph, it is important to make the run-time of each computation as short as possible. During each superstep, all workers will run computations, so that each vertex is computed once per superstep.  yorn committed Jul 03, 2014 64 During a single computation, all incoming messages to the vertex for that superstep must be handled, and possibly all outgoing edges must be iterated over to send messages to neighbouring vertices.  yorn committed Jul 04, 2014 65 Iteration is a weak spot for parallel systems, as it is not executed in parallel.  yorn committed Jul 03, 2014 66 Therefore, in order to cut on computation time, vertices should have as few edges as possible.  yorn committed Jul 04, 2014 67 %Section~\ref{sec:model_optimisation} describes how SpreadRank optimises the model.  yorn committed Jul 03, 2014 68 69 70 71 72 73  % "*** Explain a bit how you have done and which choices you made in your thesis work."