graph.tex 4.78 KB
Newer Older
yorn's avatar
yorn committed
1 2 3 4
\chapter{Graph processing}
\label{chp:graph}

In order to process large graphs, a graph processing system is required.
yorn's avatar
yorn committed
5
A graph processing system uses a data model of vertices which are connected by edges~\cite{sakr2013processing}.
yorn's avatar
yorn committed
6
Vertices can have an ID and a value, edges can have a direction and a value.
yorn's avatar
yorn committed
7
The vertex values are modified during computation, and edges can be added and/or removed during computation.
yorn's avatar
yorn committed
8 9 10 11 12 13 14 15 16 17 18 19 20
A graph system can scale out horizontally (linearly with size), and is therefore able to process graph data at large scale.
A large graph system must also be able to handle failover, such as network, storage or hardware failure.
There are multiple graph systems available,
 for example GraphLab~\cite{low2010graphlab}, 
 GraphX~\cite{xin2013graphx}, 
 Pregel~\cite{Malewicz:2010:PSL:1807167.1807184} 
 and Giraph~\cite{avery2011giraph,ching2013scaling}.
This thesis will focus on Giraph,
 but other systems are expected to give similar results.

\section{Bulk Synchronous Parallel model}
\label{sec:bsp}

yorn's avatar
yorn committed
21
\begin{figure}[h]
yorn's avatar
yorn committed
22 23 24 25 26 27 28 29
	\caption{BSP programming model~\cite{sakr2013processing}}
	\label{fig:BSP}
	\centering
		\includegraphics[width=1\textwidth]{BSP}
\end{figure}

By running calculations in parallel, more computations per second are possible.
The case for parallel computing is strengthened by the fact that processing power is cheap nowadays,
yorn's avatar
yorn committed
30 31 32 33
 but fast processing is achieved by using many processors in parallel.
Parallel computing does, however, pose an important challenge,
 in that a single processor cannot share memory with another processor, and synchronisation is expensive,
 as it requires processors to wait, as memory cannot be reliably changed by two different processes.
yorn's avatar
yorn committed
34

yorn's avatar
yorn committed
35
The Bulk Synchronous Parallel model~\cite{Valiant:1990:BMP:79173.79181} (figure~\ref{fig:BSP}) is a model for parallel computing.
yorn's avatar
yorn committed
36
It consists of components that handle processing and/or memory.
yorn's avatar
yorn committed
37 38
Operations are divided in so-called supersteps.
A router allows components to exchange point to point messages at the end of a superstep execution,
yorn's avatar
yorn committed
39 40 41 42 43
 which allows for storage access between components.
The model is useful for running large calculations that will take a very long time with standard hardware.

\section{Giraph}
Giraph is a software based implementation of the the Bulk Synchronous Parallel model,
yorn's avatar
yorn committed
44 45
 and allows for a large amount of physical machines to form a cluster.
A Giraph computation program consists of a single \verb"calculate" function, which has two arguments; (1) the vertex and (2) the messages that were sent to the vertex during the last superstep.
yorn's avatar
yorn committed
46
During every superstep of the computation, a vertex may send a message to another vertex via one of its outgoing edges.
yorn's avatar
yorn committed
47
A vertex may also ``vote to halt'', to indicate that it is done calculating.
yorn's avatar
yorn committed
48 49 50
If a vertex votes to halt, it will not be called during subsequent supersteps, unless it receives a message from another vertex.
A vertex that has received a message will always be called, and any previous votes to halt are forgotten; the vertex will have to vote to halt again.

yorn's avatar
yorn committed
51 52
Before the computation is started, the graph is read by a \verb"VertexInputFormat" or an \verb"EdgeInputFormat".
After computation, the graph is written by a \verb"VertexOutputFormat" or \verb"EdgeOutputFormat"\footnote{\url{https://giraph.apache.org/io.html}}.
yorn's avatar
yorn committed
53
When the program is started, Giraph will start worker processes on the physical machines.
yorn's avatar
yorn committed
54
The workers will each read a part of the graph, Giraph will distribute the graph, workers will execute supersteps and after the last step is completed, the workers write their part of the graph.
yorn's avatar
yorn committed
55 56 57 58 59 60 61

Giraph takes care of the communication between the workers, and distributes the workload over the workers.
Due to the nature of BSP, it is important for Giraph that every worker will take roughly the same amount of time to complete a superstep.
After each superstep, the nodes have to synchronise and start working on the next superstep, which means that if one worker is slow, all other workers have to wait for it to finish.

\section{Design choices}
\subsection{Optimisations}
yorn's avatar
yorn committed
62 63
When implementing a computation for a graph system, such as Giraph, it is important to make the run-time of each computation as short as possible.
During each superstep, all workers will run computations, so that each vertex is computed once per superstep.
yorn's avatar
yorn committed
64
During a single computation, all incoming messages to the vertex for that superstep must be handled, and possibly all outgoing edges must be iterated over to send messages to neighbouring vertices.
yorn's avatar
yorn committed
65
Iteration is a weak spot for parallel systems, as it is not executed in parallel.
yorn's avatar
yorn committed
66
Therefore, in order to cut on computation time, vertices should have as few edges as possible.
yorn's avatar
yorn committed
67
%Section~\ref{sec:model_optimisation} describes how SpreadRank optimises the model.
yorn's avatar
yorn committed
68 69 70 71 72 73




% "*** Explain a bit how you have done and which choices you made in your thesis work."