Commit 282c5bab authored by yorn's avatar yorn

Turned in

parent 3dad55b6
\pagestyle{empty}
\begin{abstract}
\noindent Anomaly detection in internet traffic is currently largely based on quantifying traffic data.
\noindent Anomaly detection in internet traffic today is largely based on quantifying traffic data.
This thesis proposes a new algorithm SpreadRank,
which detects spreading of traffic on the internet.
Spreading can be used as an additional metric for traffic anomaly detection, and can be observed from large scale traffic logs from core routers.
Studying spreading is a useful tool in determining the role of an end-host and identifying malicious behaviour.
SpreadRank uses large scale graph processing to calculate spreading from multiple gigabytes of traffic data.
which detects spreading of internet traffic as an additional metric for traffic anomaly detection.
SpreadRank uses large scale graph processing to calculate spreading from multiple gigabytes of NetFlow data obtained from core routers.
Studying spreading is a useful tool in determining the role of an end-host and in identifying malicious behaviour.
\end{abstract}
\pagestyle{empty}
\renewcommand{\abstractname}{Sammendrag}
\begin{abstract}
\noindent NORWEGIANABSTRACT
\noindent Anomalideteksjon i nettrafikk i dag er stortsett basert på mengder trafikk.
Denne avhandlingen introduserer en ny algoritme SpreadRank,
som vil detektere spredning i nettrafikk som et ekstra målepunkt for anomalideteksjon.
SpreadRank vil bruke flere gigabytes i NetFlow logg fra core-rutere i storskala grafprosessering for å beregne spredning.
Spredning er et nyttig verktøy for å undersøke hva slags rolle en datamaskin på nettet har og for å finne skadelig oppførsel.
\end{abstract}
\ No newline at end of file
......@@ -2,57 +2,57 @@
\label{chp:anomalies}
This thesis will propose the SpreadRank algorithm,
which will detect traffic spreading on OSI layer 4.
What this means, is that it will detect end-hosts (clients and servers) that initiate the same kind of connections that they receive.
The discriminator used to identify same-kind connections is the \gls{contact port} number associated with a flow.
The assumption is that a service both receiving and initiating connections to the same port is an anomaly.
This thesis will introduce the SpreadRank algorithm,
which will detect traffic \gls{spreading} on the \gls{transport layer}.
This means that it will detect hosts (clients and servers) that initiate the same kind of connections that they receive.
The discriminator used to identify same-kind connections is the \emph{\gls{contact port} number} associated with the flow.
The assumption is that a \gls{service} both receiving and initiating connections to the same port is an anomaly.
For example, a workstation with a web browser and e-mail client will initiate connections to ports 80 (HTTP), 443 (HTTPS), 587 (SMTP submission) and 143 or 993 (IMAP and IMAPS respectively).
However, it is not expected that such a workstation would \emph{receive} connections towards these port numbers.
Neither is it expected that the web server would initiate connections directed towards port 80 or 443.
Neither is it expected that a web server would \emph{initiate} connections directed towards port 80 or 443.
This is not true for all protocols.
For example, a mail server can forward e-mail to another mail server via port 25 (SMTP),
and this second mail server may forward it to a third mail server over port 25 and so forth.
There are several other protocols that exhibit natural spreading, these are described in section~\ref{sec:proto_spreading}.
There are several other protocols that exhibit natural \gls{spreading}, these are described in section~\ref{sec:proto_spreading}.
\section{Servers undergoing maintenance}
\begin{figure}[h!]
\begin{figure}[h]
\caption{Observed spreading caused by a server under maintenance}
\label{fig:spread_maintenance}
\centering
\includegraphics[width=0.5\textwidth]{spread_maintenance}
\end{figure}
Figure~\ref{fig:spread_maintenance} shows spreading caused by a server undergoing maintenance.
Figure~\ref{fig:spread_maintenance} shows \gls{spreading} caused by a server undergoing maintenance.
A server undergoing maintenance may connect to another server,
for example to get the latest software updates.
Additionally, there exists server software that includes a graphical user interface,
which gives the server the same look and feel as a workstation.
On these servers, the administrator may be inclined to use a web browser to download software or even browse the web.
On these servers, the administrator may be inclined to use a web browser to download software, or even browse the web.
\section{Home connections which run a personal server}
\begin{figure}[h!]
\begin{figure}[h]
\caption{Observed spreading caused by a single home server}
\label{fig:spread_home}
\centering
\includegraphics[width=0.5\textwidth]{spread_home}
\end{figure}
Figure~\ref{fig:spread_home} shows spreading caused by servers run by home users on a shared public IP address.
Figure~\ref{fig:spread_home} shows \gls{spreading} caused by servers run by home users on a shared public IP address.
Users with a home server will often run the home server on the same IP address as the client.
This is done either by simply using the same machine as server and client,
or by using a NAT router, allowing a separate client and server machine to share a public IP address.
When a client and a server share an IP address, this IP address will generally both receive and initiate flows.
When a client and a server share an IP address, this IP address will generally both receive and initiate connections.
Since there is no direct way to distinguish whether incoming and outgoing flows are related,
such a situation might register as a case of spreading.
However, home servers are quite common, but a trail of home users with servers contacting each other can show up as an anomaly.
such a situation might register as a case of \gls{spreading}.
However, home servers are quite common, though a large trail of users with home servers contacting each other can appear to be as an anomaly.
\section{VPN servers}
\begin{figure}[h!]
\begin{figure}[h]
\caption{Observed spreading caused by a VPN server}
\label{fig:spread_vpn}
\centering
......@@ -60,39 +60,42 @@ However, home servers are quite common, but a trail of home users with servers c
\end{figure}
Clients can set up a secure connection to a VPN server (figure~\ref{fig:spread_vpn}).
All internet traffic initiated at the client (apart from the VPN traffic itself) will be sent to the VPN server in encapsulated form.
The VPN server removes the encapsulating, and forwards the traffic towards the internet, possibly mimicking a client.
Thus, if the client tunnels another VPN connection through the VPN server, the VPN server may appear to be a VPN client as well.
All internet traffic initiated at the client (apart from the VPN traffic itself), will be sent to the VPN server in encapsulated form.
The VPN server removes the encapsulating, and forwards the traffic towards the internet, mimicking the client.
Thus, if the client uses another VPN service through this VPN server, the VPN server may appear to be a VPN client as well.
Some providers allow using SSH, HTTP or HTTPS for VPN.
Some software allows using SSH, HTTP or HTTPS for VPN.
Examples of software that can do this are sshttp~\cite{github:stealth:sshttp} and Microsoft RRAS~\cite{rras}.
When a user connects to a VPN server over HTTPS, and then uses the VPN to connect to a HTTPS host, this can register as spreading.
However, some VPN servers allocate dedicated public IP addresses to their clients,
which means that connections are not forwarded over the same IP, which means that the VPN does not register as spreading.
When a user connects to a VPN server over HTTPS, and then uses the VPN to connect to a HTTPS host, this can register as \gls{spreading}.
However, some VPN providers allocate dedicated public IP addresses to their clients,
which means that connections are not forwarded over the same IP, which means that the VPN does not register as \gls{spreading}.
\section{Protocols with natural spreading}
\label{sec:proto_spreading}
\subsection{BGP}
BGP is a protocol used by routers to exchange routing information.
Even though routers are level three devices themselves, they exchange routing information with each other the same way end-hosts do.
Routers both listen for incoming BGP connections, and send BGP to neighbouring routers at regular intervals.
Since routers both receive and initiate BGP flows, BGP spreading is expected to be high.
\Gls{BGP} is a protocol used by routers to exchange routing information.
Even though routers are level three devices themselves, they exchange routing information with each other the same way hosts do.
Routers both listen for incoming \gls{BGP} connections, and send \gls{BGP} to neighbouring routers at regular intervals.
Since routers both receive and initiate \gls{BGP} flows, \gls{BGP} \gls{spreading} is expected to be high.
\subsection{DNS}
Clients use DNS to find the IP address associated to a hostname.
If the DNS server does not have the record in its cache,
DNS is used to find the IP address associated to a hostname.
There are two types of DNS servers; \emph{recursive} and \emph{authoritative}.
When a \emph{recursive} DNS server receives a query, and does not have the requested record in its cache,
it must forward the query to an upstream server,
or find the authoritative DNS server for that host and forward the query there.
Forwarding DNS is expected to cause high spreading.
or find the authoritative DNS server for that record and forward the query there.
An \emph{authoritative} server has collections of records (zones) configured,
though will only answer queries about these zones.
Recursive DNS servers are expected to have high \gls{spreading}, authoritative DNS servers are expected to have low \gls{spreading}.
\subsection{SMTP}
Most e-mail clients are configured with a so-called ``smart host'',
which is the server that all outgoing mail is sent to.
This smart host will then find out which server handles e-mail for the recipient, and forward the message there.
In some cases, the e-mail will pass multiple servers before finally reaching a mailbox.
Naturally, this means that some spreading in the SMTP protocol is expected.
In some cases, the e-mail will pass through multiple servers before finally reaching a mailbox.
Naturally, this means that some \gls{spreading} in the SMTP protocol is expected.
\section{HTTP/HTTPS}
......@@ -102,7 +105,7 @@ SpreadRank may therefore observe trails of machines connecting over port 80 or 4
without these flows actually being related.
There is currently no direct way to view the difference between different services provided over HTTP or HTTPS, purely based on NetFlow data.
%Possibly less clear cases of spreading are \gls{SSH} and \gls{HTTP}.
%Possibly less clear cases of \gls{spreading} are \gls{SSH} and \gls{HTTP}.
%Although these protocols are meant for client-server communication (\gls{SSH} allows clients to connect to a remote shell, \gls{HTTP} is the protocol used for web pages),
% it is possible that a user will use the SSH session to another SSH server from there.
%A proxy web server can request a page on its clients behalf.
......@@ -110,33 +113,34 @@ There is currently no direct way to view the difference between different servic
\section{Combination}
\begin{figure}[h!]
\begin{figure}[h]
\caption{Observed spreading caused by a combination of benign factors}
\label{fig:spread_combi}
\centering
\includegraphics[width=0.75\textwidth]{spread_combi}
\end{figure}
A combination of different occurrences of spreading may lead to larger spreading, as shown in figure~\ref{fig:spread_combi}.
However, the depth, is still relatively limited.
This means that relatively short paths are not anomalies.
%From which length spreading becomes an anomaly depends on the protocol and the size of the NetFlow dataset,
A combination of different occurrences of \gls{spreading} may lead to larger \gls{spreading}, as shown in figure~\ref{fig:spread_combi};
however, the longest path is still relatively short, which means that relatively short paths are not anomalies.
Longer paths will typically be anomalies, though it depends on the protocol how long the path must be before it is considered an anomaly,
for instance, one would expect higher \gls{spreading} in SMTP and DNS than in HTTP.
%From which length \gls{spreading} becomes an anomaly depends on the protocol and the size of the NetFlow dataset,
% but from the results of the experiment, 20-30 steps seem like a reasonable amount.
%Because of this, if only a single host is observed generating the same kind of traffic as it receives, no conclusions can be drawn from that.
%However, if there is a long trail of hosts sending the same kind of traffic, there is a good chance that this is a worm at work.
\section{Worm infection}
A worm is a piece of software that is programmed to infect other hosts in such a way that these hosts will run the worm, too.
It infects a host by looking for systems that run unpatched software,
and using a known security hole to get access to the service and install itself.
After infection, the victim will start doing the same.
A worm is a piece of software that is programmed to infect other hosts and start instances of itself on these hosts.
It infects hosts by looking for systems that run unpatched software,
abusing a known security hole to get access to the service and install itself on the host.
After infection, the victim host will also try to infect other hosts~\cite{pastor2001epidemic}.
This process can go on for ever, until a system administrator starts blocking these flows.
Consequently, a worm that successfully spreads over many end-hosts, will show large spreading.
Consequently, a worm that successfully spreads over many hosts, will show large \gls{spreading}.
\section{Peer to peer traffic}
Peer to peer traffic is another case of spreading that happens intentionally;
Peer to peer traffic is another case of \gls{spreading} that happens intentionally;
a hallmark example is BitTorrent.
BitTorrent is a protocol for distributed file sharing, where all clients will make chunks they have downloaded available to other clients.
This differs from the traditional client/server model, where clients do not spread information to other clients but get all their information from a central server.
......@@ -148,21 +152,22 @@ before sending it to the final destination.
Tor also operates on random ports, which makes it difficult to match flows together.
\subsection{Stealth worm}
Due to BitTorrent and Tor being able to hide their activity from an algorithm as SpreadRank,
\label{sec:stealth}
Due to BitTorrent and Tor being able to hide their activity from an algorithm such as SpreadRank,
the question arises whether a worm can do the same thing to hide its activity.
The important difference between benign traffic, such as BitTorrent and Tor, and malicious traffic such as worms,
is consent from the owner of the host.
If a host is participating in BitTorrent or Tor, it is because the owner of the host voluntarily installed software to supports these protocols.
If a host is participating in BitTorrent or Tor, it is because the owner of this host voluntarily installed software to supports these protocols.
Because these protocols can rely on software being installed, they can use virtually any connection model imaginable.
This is different for malware;
the initial infection must happen through a known vulnerability in the host.
This vulnerability will typically be part of the operating system or some type software that is installed.
Malware has therefore limited possibilities for infection, and must follow the specification of the software it wants to attack, which will often run on a standard port.
This vulnerability will typically be part of the operating system or part of installed software.
Malware has therefore limited possibilities for infection, and must use the port of the service it wants to attack.
Thus, in order to attack a specific service, malware must target the same port for every attack.
Once a host has been infected, the malware can install itself, and it may communicate with command-and-control servers through stealth communication channels that may not be as easy to detect.
However, it will in all likelihood still try to attack other hosts, to increase the amount of infected hosts, for which it will have to generate observable traffic, for reasons mentioned earlier.
Once a host has been infected, the malware can install itself, and it may communicate with command-and-control servers through stealth communication channels that may not be as easy to detect;
however, it will in all likelihood still try to attack other hosts, to increase the amount of infected hosts, for which it will have to generate observable traffic, for reasons mentioned earlier.
......@@ -7,13 +7,13 @@ This appendix will show the commands used to convert NetFlow to \gls{csv},
start SpreaRank with these \gls{csv} files,
and filter the results.
\newpage
\section{Convert NetFlow to CSV}
NetFlow can be converted to a \gls{csv} file, with non-UDP and non-TCP flows filtered out, as well as flows between port numbers that are both over 1024.
This conversion is done using FlowConvert.sh, but in order to generate separate \gls{csv} files, FlowConvert must be run for each timeframe that a \gls{csv} file should span.
This conversion is done using FlowConvert.sh, though in order to generate separate \gls{csv} files, FlowConvert must be run for each time frame that a \gls{csv} file should span.
This snippet of \gls{bash} code will run FlowConvert.sh 31 times, for every day of the month.
The end result will be a directory named \verb"trd_gw1_12_filtered.csv" with 31 files.
\subsection{NetflowCSVEdgeInputFormat.java}
\begin{verbatim}
seq 1 9 | while read nr
do
......@@ -25,10 +25,14 @@ The end result will be a directory named \verb"trd_gw1_12_filtered.csv" with 31
done
\end{verbatim}
\newpage
\section{Execute SpreadRank}
In order to execute SpreadRank on YARN,
YARN must be started with the SpreadRank jarfile,
and the full qualified class-name of the computation class.
the application must be submitted to the YARN manager,
so that it can deploy the SpreadRank jar file to all workers.
The command takes as arguments the path of the jar file,
and the fully qualified class-name of both Giraph.
Giraph takes the fully qualified class-name of the SpreadRank computation class.
Additionally, a format for reading the graph at the start, and writing the graph at the end must be provided.
In this case, we use a directory to store the graph data.
The directory contains chunks of the full graph.
......@@ -47,4 +51,42 @@ The directory contains chunks of the full graph.
\end{verbatim}
\newpage
\section{Filter results}
The raw results from SpreadRank, even though outputted by a custom \verb"OutputFormat", contain a lot of information that is not relevant.
This simple \gls{bash} script will remove all vertices with a \gls{depth} of 1 or less, and all vertices with no spreading or no clients.
The program is built using these components:
\begin{itemize}[label={$\bullet$}]
\item \verb"cat": concatenate all parts of the output
\item \verb"grep": filter out all values equal to 0 or 1
\item \verb"sort": make sure that the highest spreading is returned first
\item \verb"sed": anonymize the results by removing IP addresses though keeping port numbers
\item \verb"less": easier reading of the results in a terminal
\end{itemize}
Components can taken out if desired, to change the results.
Total execution time of this command on one month worth of NetFlow data took about 10\~15~seconds on UNINETT's server cluster.
\begin{verbatim}
cat rank-out/IPSpreadRank_gw1_12/part-m-* | egrep -vw '[01]' | sort -hrk 3 | sed 's/^.*://' | less
\end{verbatim}
\newpage
\section{Aggregate}
This \gls{bash} script can be used to determine how often a protocol engages in spreading.
It will output two numbers per line, the second being the protocol number, the first being the amount of services observed.
The program is built using these components:
\begin{itemize}[label={$\bullet$}]
\item \verb"cat": concatenate all parts of the output
\item \verb"grep": filter out all values equal to 0 or 1
\item \verb"sed": remove IP addresses, otherwise the list would show all IP addresses with count 1
\item \verb"cut": remove all but the first value (port number)
\item \verb"sort": set the port numbers in order, required for \verb"uniq".
\item \verb"uniq": aggregate the port numbers, and show count
\end{itemize}
Total execution time of this command on one month worth of NetFlow data took about 10\~15~seconds on UNINETT's server cluster.
\begin{verbatim}
cat rank-out/IPSpreadRank_gw1_12/part-m-* | egrep -vw '[01]' | sed 's/^.*://' | cut -f1 | sort | uniq -c
\end{verbatim}
\end{landscape}
......@@ -5,7 +5,7 @@ This thesis has stated the current practices in traffic anomaly detection.
Current traffic anomaly detection is based on aggregating flows,
or by identifying high-intensity traffic.
The thesis defines the concept of spreading,
which describes the phenomenon of an end-host initiating the same kind of connections it receives,
which describes the phenomenon of a host initiating the same kind of connections it receives,
where same-kind refers to usage of the same TCP/UDP ports.
It is argued that spreading is uncommon for most protocols on the internet today, which makes it an anomaly.
......@@ -21,28 +21,29 @@ Every service is scored on its longest path towards another service (\gls{depth}
and on how far it spreads its traffic (\gls{spreading}).
An analysis of the results shows that only five percent of all \emph{\gls{service}s} participate in \emph{\gls{spreading}},
and that for most end-hosts their role can be determined by simply looking at their spreading.
A spreading of zero typically indicates a server, one typically indicates a client (its traffic reaches to the servers) and a spreading of two often indicates a hybrid, but it can also indicate a simple proxy server.
and that for many hosts, their role can be determined by simply looking at their spreading.
A spreading of zero typically indicates a server, one typically indicates a client (its traffic reaches to the servers) and a spreading of two often indicates a combination of both, though it can also indicate a simple proxy server.
Some protocols have natural spreading, for example \gls{BGP}, \gls{DNS} and \gls{SMTP};
implementations of these protocols are often both server and client.
\Gls{SSH} and \gls{HTTP} are popular protocols which do not have natural spreading,
but do exhibit spreading nevertheless.
implementations of these protocols behave often both as server and client.
\Gls{SSH} and \gls{HTTP} are popular protocols which do not have natural spreading;
however, these protocols exhibit spreading nevertheless.
\Gls{HTTP} has a high \gls{spreading} due to it being a protocol that is used for multiple purposes.
An \gls{HTTP} \gls{service} may itself use \gls{HTTP} to connect to another \gls{service}.
\Gls{SSH} allows users to ``hop'' from one SSH server to another.
Additionally, many home servers will listen on these ports.
SpreadRank has been successful in identifying spreading, and in doing so can be successfully used to find DNS resolvers,
BGP routers and mail servers on the network.
It has also been successful in finding hosts that participated in a botnet.
It does so based on NetFlow data from core routers, without sending data into the network itself.
\section{Future work}
\subsection{Automation}
The current implementation of SpreadRank requires the analyst to provide many manual steps.
A better EdgeInputReader would reduce the amount of conversions needed,
The current implementation of SpreadRank requires the analyst to execute many manual steps.
A better \verb"EdgeInputFormat" would reduce the amount of conversions needed,
and speed up the overall process.
The output could be parsed to automatically find outliers.
The output could then be parsed to automatically find outliers.
\subsection{Test-data}
It was not known beforehand which attacks were present in the test-data.
......@@ -56,7 +57,7 @@ For a real-life application, real-time monitoring is required,
In order to work with real-time data, a sliding window is required,
in which flows are added to the graph as they are observed,
and old flows are removed.
Giraph does not currently support this, but systems that support this do exist, for example GraphX.
Giraph does not currently support this; however, systems that support this do exist, for example GraphX.
\subsection{IPv6}
The NetFlow data provided by UNINETT contains only flows between IPv4 hosts.
......@@ -67,6 +68,6 @@ Where many home servers currently share their public IPv4 address with clients d
IPv6 makes it possible to run the server and client on different IPv6 addresses.
Additionally, the use of privacy extensions in IPv6 (a technology that lets clients randomise their IP address for anonymity) will prove to be both a challenge and an advantage for SpreadRank.
It is a challenge because it will lead to more vertices; in the current model every observed IP address + port number is a vertex, which means that a new vertex is created every time a client switches IP address.
It is a challenge because it will lead to more vertices; in the current model, every observed IP address + port number pair is a vertex, which means that a new vertex is created every time a client switches IP address.
Randomised IP addresses are also an advantage, because it is not feasible to run a server on a randomised IP address.
This may make it easier to designate vertices as clients at a very early stage, which reduces computation time.
\begin{landscape}
\chapter{Diagrams}
\section{SpreadRank results}
\begin{figure}[h!]
\chapter{SpreadRank results}
\label{chp:diagrams}
This appendix contains diagrams of SpreadRank being executed on NetFlow logs from one of UNINETT's core routers.
These diagrams are made using different periods of NetFlow logs from the same core router.
\begin{figure}[h]
\caption{Vertices scored with SpreadRank using one day of flows}
\label{fig:1day}
\centering
\includegraphics[width=1.3\textwidth]{1day}
\includegraphics[width=1.4\textwidth]{1day}
\end{figure}
\begin{figure}[h!]
\begin{figure}[h]
\caption{Vertices scored with SpreadRank using one week of flows}
\label{fig:7day}
\centering
\includegraphics[width=1.3\textwidth]{7day}
\includegraphics[width=1.4\textwidth]{7day}
\end{figure}
\begin{figure}[h!]
\begin{figure}[h]
\caption{Vertices scored with SpreadRank using two weeks of flows}
\label{fig:14day}
\centering
\includegraphics[width=1.3\textwidth]{14day}
\includegraphics[width=1.4\textwidth]{14day}
\end{figure}
\begin{figure}[h!]
\begin{figure}[h]
\caption{Vertices scored with SpreadRank using one month of flows}
\label{fig:31day}
\centering
\includegraphics[width=1.3\textwidth]{31day}
\includegraphics[width=1.4\textwidth]{31day}
\end{figure}
\end{landscape}
......@@ -3,11 +3,11 @@
To test Giraph, two simple algorithms, DOSRank and ReverseDOSRank are created, which count the amount of respectively incoming and outgoing edges from or to a vertex.
This proof of concept will count the amount of flows per \gls{service}.
It can be used to detect if a machine is executing or suffering from a DoS attack,
It can be used to detect if a machine is executing or suffering from a \gls{DoS} attack,
as these are typically identified by a large number of flows.
These two algorithms will count respectively the amount of outgoing and the amount of incoming edges.
DOSRank (equation~\ref{eq:dosrank}, algorithm~\ref{alg:dosrank}), which will calculate outgoing edges (incoming connections) is trivial; it will set its value during the first superstep and vote to halt.
ReverseDOSRank (equation~\ref{eq:reversedosrank}, algorithm~\ref{alg:reversedosrank}) is a bit more complex;
DOSRank~(\ref{eq:dosrank}), which will calculate outgoing edges (incoming connections) is trivial; it will set its value during the first superstep and then vote to halt.
ReverseDOSRank~(\ref{eq:reversedosrank}) is a bit more complex;
since incoming edges are not visible in Giraph, the first superstep is used to send a message over every edge.
During the second superstep, every vertex will receive an amount of messages that is equal to its amount of incoming edges.
......@@ -21,17 +21,17 @@ During the second superstep, every vertex will receive an amount of messages tha
f(y) = N^{+}(y).
\end{equation}
\begin{algorithm}[H]
\label{alg:dosrank}
\begin{algorithm}[h]
\caption{DOSRank}
\label{alg:dosrank}
\begin{verbatim}
result = vertex.getNumEdges();
vertex.voteToHalt();
\end{verbatim}
\end{algorithm}
\begin{algorithm}[H]
\label{alg:reversedosrank}
\begin{algorithm}[h]
\caption{ReverseDOSRank}
\label{alg:reversedosrank}
\begin{verbatim}
if superstep = 0 then
begin
......@@ -55,15 +55,15 @@ The algorithm will find IP addresses that are associated with large amounts of f
Opening a large amount of flows is different from sending large amounts of data,
as sending a large amount of data is often a completely legitimate thing to do:
For example, sharing files via peer-to-peer protocols or serving large files over HTTP.
A large amount of flows, however, may be an indication that something is amiss.
It can indicate port scanning, or even targeted attacks like SYN flooding.
However, a large amount of flows can also simply indicate a very active or popular host.
A large amount of flows; however, may be an indication that something is amiss.
It can indicate port scanning, or even targeted attacks like SYN flooding;
however, a large amount of flows can also simply indicate a very active or popular host.
\section{Purpose}
These algorithms are just a proof-of-concept to show that graph systems can be used for traffic analysis.
The algorithm itself is most likely easier to implement using other systems.
A simple MapReduce algorithm could probably yield the same results a smaller amount of time~\cite{Morken352472}.
However, this algorithm does show that Giraph can be used for traffic analysis,
A simple MapReduce algorithm could probably yield the same results a smaller amount of time~\cite{Morken352472};
however, this algorithm does show that Giraph can be used for traffic analysis,
and it opens the door for more complex algorithms, like SpreadRank.
No preview for this file type
No preview for this file type
No preview for this file type
No preview for this file type
No preview for this file type
No preview for this file type
No preview for this file type
......@@ -138,11 +138,11 @@
borderlayer="false"
showborder="true"
objecttolerance="20"
inkscape:window-width="1855"
inkscape:window-width="1853"
inkscape:window-height="1156"
inkscape:window-x="65"
inkscape:window-x="67"
inkscape:window-y="0"
inkscape:window-maximized="1">
inkscape:window-maximized="0">
<inkscape:grid
type="xygrid"
id="grid3128"
......@@ -168,6 +168,17 @@
inkscape:groupmode="layer"
id="layer1"
transform="translate(0,-308.2677)">
<path
style="opacity:1;fill:none;stroke:#aaaaaa;stroke-width:32;stroke-linecap:round;stroke-linejoin:round;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
d="m 80,452.36218 0,40 280,0 0,200 280,0 0,80 280,0"
id="path3015"
inkscape:connector-curvature="0" />
<path
style="opacity:1;fill:none;stroke:#bfbfbf;stroke-width:6;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;stroke-dashoffset:0;marker-end:url(#Arrow2Mend)"
d="m 80,612.36218 270,0"
id="path6784"
inkscape:connector-curvature="0"
sodipodi:nodetypes="cc" />
<path
sodipodi:type="arc"
id="path3130"
......@@ -180,7 +191,7 @@
style="fill:#ffffff;fill-opacity:1;stroke:#000000;stroke-width:6;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none" />
<path
style="fill:none;stroke:#000000;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
d="m 80,1012.3622 0,-560.00004 0,0 0,0"
d="m 80,832.36218 0,-380.00002 0,0 0,0"
id="path3136"
inkscape:connector-curvature="0"
sodipodi:nodetypes="cccc" />
......@@ -197,7 +208,7 @@
<path
inkscape:connector-curvature="0"
id="path4020"
d="m 360,1012.3622 0,-560.00002 0,0 0,0"
d="m 360,832.36218 0,-380 0,0 0,0"
style="fill:none;stroke:#000000;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
sodipodi:nodetypes="cccc" />
<path
......@@ -213,7 +224,7 @@
<path
sodipodi:nodetypes="cccc"
style="fill:none;stroke:#000000;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
d="m 640,1012.3622 0,-560.00002 0,0 0,0"
d="m 640,832.36218 0,-380 0,0 0,0"
id="path4024"
inkscape:connector-curvature="0" />
<path
......@@ -222,12 +233,6 @@
id="path4026"
inkscape:connector-curvature="0"
sodipodi:nodetypes="cc" />
<path
style="opacity:0.25;fill:none;stroke:#000000;stroke-width:6;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;stroke-dashoffset:0;marker-end:url(#Arrow2Mend)"
d="m 80,612.36218 270,0"
id="path6784"
inkscape:connector-curvature="0"
sodipodi:nodetypes="cc" />
<path
sodipodi:nodetypes="cc"
inkscape:connector-curvature="0"
......@@ -247,7 +252,7 @@
<path
inkscape:connector-curvature="0"
id="path6857"
d="m 920,1012.3622 0,-560.00002 0,0 0,0"
d="m 920,832.36218 0,-380 0,0 0,0"
style="fill:none;stroke:#000000;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
sodipodi:nodetypes="cccc" />
<path
......@@ -256,10 +261,5 @@
id="path6859"
inkscape:connector-curvature="0"
sodipodi:nodetypes="cc" />
<path
style="opacity:0.33000003999999999;fill:none;stroke:#000000;stroke-width:32;stroke-linecap:round;stroke-linejoin:round;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;"
d="m 80,452.36218 0,40 280,0 0,200 280,0 0,80 280,0"
id="path3015"
inkscape:connector-curvature="0" />
</g>
</svg>
......@@ -168,6 +168,23 @@
inkscape:groupmode="layer"
id="layer1"
transform="translate(0,-308.2677)">
<path
style="opacity:1;fill:none;stroke:#aaaaaa;stroke-width:32;stroke-linecap:round;stroke-linejoin:round;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
d="m 80,452.36218 0,40 280,0 0,200 280,0"
id="path3057"
inkscape:connector-curvature="0" />
<path
style="opacity:1;fill:none;stroke:#bfbfbf;stroke-width:6;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;stroke-dashoffset:0;marker-end:url(#Arrow2Mend)"
d="m 80,612.36218 270,0"
id="path6784"
inkscape:connector-curvature="0"
sodipodi:nodetypes="cc" />
<path
sodipodi:nodetypes="cc"
inkscape:connector-curvature="0"
id="path3055"
d="m 640,772.36218 270,0"
style="opacity:1;fill:none;stroke:#bfbfbf;stroke-width:6;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none;stroke-dashoffset:0;marker-end:url(#Arrow2Mend)" />
<path
sodipodi:type="arc"
id="path3130"
......@@ -180,7 +197,7 @@
style="fill:#ffffff;fill-opacity:1;stroke:#000000;stroke-width:6;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none" />
<path
style="fill:none;stroke:#000000;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
d="m 80,1012.3622 0,-560.00004 0,0 0,0"
d="m 80,832.36218 0,-380.00002 0,0 0,0"
id="path3136"
inkscape:connector-curvature="0"
sodipodi:nodetypes="cccc" />
......@@ -197,7 +214,7 @@
<path
inkscape:connector-curvature="0"
id="path4020"
d="m 360,1012.3622 0,-560.00002 0,0 0,0"
d="m 360,832.36218 0,-380 0,0 0,0"
style="fill:none;stroke:#000000;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-opacity:1;stroke-dasharray:none"
sodipodi:nodetypes="cccc" />
<path
......@@ -213,7 +230,7 @@