conclusion.tex 4.76 KB
Newer Older
yorn's avatar
yorn committed
1 2 3 4 5 6 7
\chapter{Conclusion}
\label{chp:conclusion}

This thesis has stated the current practices in traffic anomaly detection.
Current traffic anomaly detection is based on aggregating flows,
 or by identifying high-intensity traffic.
The thesis defines the concept of spreading,
yorn's avatar
yorn committed
8
 which describes the phenomenon of a host initiating the same kind of connections it receives,
yorn's avatar
yorn committed
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 where same-kind refers to usage of the same TCP/UDP ports.
It is argued that spreading is uncommon for most protocols on the internet today, which makes it an anomaly.

The usage of graph systems is proposed as a means to measure spreading.
Multiple graph systems are available today, this thesis uses the Giraph system.
Using Giraph, NetFlow information provided by UNINETT is converted to a graph,
 where spreading is calculated using SpreadRank, an algorithm introduced in this thesis.

SpreadRank works by making a graph of flow data.
In this graph, vertices represent IP address and port number pairs (\gls{service}s),
 and edges represent flows, and have the flow start time as value.
Every service is scored on its longest path towards another service (\gls{depth}),
 and on how far it spreads its traffic (\gls{spreading}).

An analysis of the results shows that only five percent of all \emph{\gls{service}s} participate in \emph{\gls{spreading}},
yorn's avatar
yorn committed
24 25
 and that for many hosts, their role can be determined by simply looking at their spreading.
A spreading of zero typically indicates a server, one typically indicates a client (its traffic reaches to the servers) and a spreading of two often indicates a combination of both, though it can also indicate a simple proxy server.
yorn's avatar
yorn committed
26 27

Some protocols have natural spreading, for example \gls{BGP}, \gls{DNS} and \gls{SMTP};
yorn's avatar
yorn committed
28 29 30
 implementations of these protocols behave often both as server and client.
\Gls{SSH} and \gls{HTTP} are popular protocols which do not have natural spreading;
 however, these protocols exhibit spreading nevertheless.
yorn's avatar
yorn committed
31 32 33
\Gls{HTTP} has a high \gls{spreading} due to it being a protocol that is used for multiple purposes.
An \gls{HTTP} \gls{service} may itself use \gls{HTTP} to connect to another \gls{service}.
\Gls{SSH} allows users to ``hop'' from one SSH server to another.
yorn's avatar
yorn committed
34

yorn's avatar
yorn committed
35 36 37

SpreadRank has been successful in identifying spreading, and in doing so can be successfully used to find DNS resolvers,
 BGP routers and mail servers on the network.
yorn's avatar
yorn committed
38
It has also been successful in finding hosts that participated in a botnet.
yorn's avatar
yorn committed
39 40 41 42
It does so based on NetFlow data from core routers, without sending data into the network itself.

\section{Future work}
\subsection{Automation}
yorn's avatar
yorn committed
43 44
The current implementation of SpreadRank requires the analyst to execute many manual steps.
A better \verb"EdgeInputFormat" would reduce the amount of conversions needed,
yorn's avatar
yorn committed
45
 and speed up the overall process.
yorn's avatar
yorn committed
46
The output could then be parsed to automatically find outliers.
yorn's avatar
yorn committed
47 48 49 50 51 52 53 54 55 56 57 58 59

\subsection{Test-data}
It was not known beforehand which attacks were present in the test-data.
The experiment should be repeated with test-data with known attacks.
The best time to do this, is most likely after a large worm outbreak.

\subsection{Real-time monitoring}
The experiment was conducted on static test data.
For a real-life application, real-time monitoring is required,
 as this makes it possible to set automatic alarms when something is amiss.
In order to work with real-time data, a sliding window is required,
 in which flows are added to the graph as they are observed,
and old flows are removed.
yorn's avatar
yorn committed
60
Giraph does not currently support this; however, systems that support this do exist, for example GraphX.
yorn's avatar
yorn committed
61 62 63 64 65 66 67 68 69 70

\subsection{IPv6}
The NetFlow data provided by UNINETT contains only flows between IPv4 hosts.
The SpreadRank algorithm may need some modifications to be able to be used on IPv6.
The two most important differences between IPv4 and IPv6 for SpreadRank are that IPv6 addresses are a lot longer and will therefore not fit in the current data types.
Additionally, another traffic pattern will be observed regarding home servers.
Where many home servers currently share their public IPv4 address with clients due to the use of technologies such as NAT,
 IPv6 makes it possible to run the server and client on different IPv6 addresses.

Additionally, the use of privacy extensions in IPv6 (a technology that lets clients randomise their IP address for anonymity) will prove to be both a challenge and an advantage for SpreadRank.
yorn's avatar
yorn committed
71
It is a challenge because it will lead to more vertices; in the current model, every observed IP address + port number pair is a vertex, which means that a new vertex is created every time a client switches IP address.
yorn's avatar
yorn committed
72 73
Randomised IP addresses are also an advantage, because it is not feasible to run a server on a randomised IP address.
This may make it easier to designate vertices as clients at a very early stage, which reduces computation time.