conclusion.tex 4.76 KB
 yorn committed Jul 03, 2014 1 2 3 4 5 6 7 \chapter{Conclusion} \label{chp:conclusion} This thesis has stated the current practices in traffic anomaly detection. Current traffic anomaly detection is based on aggregating flows, or by identifying high-intensity traffic. The thesis defines the concept of spreading,  yorn committed Jul 04, 2014 8  which describes the phenomenon of a host initiating the same kind of connections it receives,  yorn committed Jul 03, 2014 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  where same-kind refers to usage of the same TCP/UDP ports. It is argued that spreading is uncommon for most protocols on the internet today, which makes it an anomaly. The usage of graph systems is proposed as a means to measure spreading. Multiple graph systems are available today, this thesis uses the Giraph system. Using Giraph, NetFlow information provided by UNINETT is converted to a graph, where spreading is calculated using SpreadRank, an algorithm introduced in this thesis. SpreadRank works by making a graph of flow data. In this graph, vertices represent IP address and port number pairs (\gls{service}s), and edges represent flows, and have the flow start time as value. Every service is scored on its longest path towards another service (\gls{depth}), and on how far it spreads its traffic (\gls{spreading}). An analysis of the results shows that only five percent of all \emph{\gls{service}s} participate in \emph{\gls{spreading}},  yorn committed Jul 04, 2014 24 25  and that for many hosts, their role can be determined by simply looking at their spreading. A spreading of zero typically indicates a server, one typically indicates a client (its traffic reaches to the servers) and a spreading of two often indicates a combination of both, though it can also indicate a simple proxy server.  yorn committed Jul 03, 2014 26 27  Some protocols have natural spreading, for example \gls{BGP}, \gls{DNS} and \gls{SMTP};  yorn committed Jul 04, 2014 28 29 30  implementations of these protocols behave often both as server and client. \Gls{SSH} and \gls{HTTP} are popular protocols which do not have natural spreading; however, these protocols exhibit spreading nevertheless.  yorn committed Jul 03, 2014 31 32 33 \Gls{HTTP} has a high \gls{spreading} due to it being a protocol that is used for multiple purposes. An \gls{HTTP} \gls{service} may itself use \gls{HTTP} to connect to another \gls{service}. \Gls{SSH} allows users to hop'' from one SSH server to another.  yorn committed Jul 04, 2014 34   yorn committed Jul 03, 2014 35 36 37  SpreadRank has been successful in identifying spreading, and in doing so can be successfully used to find DNS resolvers, BGP routers and mail servers on the network.  yorn committed Jul 04, 2014 38 It has also been successful in finding hosts that participated in a botnet.  yorn committed Jul 03, 2014 39 40 41 42 It does so based on NetFlow data from core routers, without sending data into the network itself. \section{Future work} \subsection{Automation}  yorn committed Jul 04, 2014 43 44 The current implementation of SpreadRank requires the analyst to execute many manual steps. A better \verb"EdgeInputFormat" would reduce the amount of conversions needed,  yorn committed Jul 03, 2014 45  and speed up the overall process.  yorn committed Jul 04, 2014 46 The output could then be parsed to automatically find outliers.  yorn committed Jul 03, 2014 47 48 49 50 51 52 53 54 55 56 57 58 59  \subsection{Test-data} It was not known beforehand which attacks were present in the test-data. The experiment should be repeated with test-data with known attacks. The best time to do this, is most likely after a large worm outbreak. \subsection{Real-time monitoring} The experiment was conducted on static test data. For a real-life application, real-time monitoring is required, as this makes it possible to set automatic alarms when something is amiss. In order to work with real-time data, a sliding window is required, in which flows are added to the graph as they are observed, and old flows are removed.  yorn committed Jul 04, 2014 60 Giraph does not currently support this; however, systems that support this do exist, for example GraphX.  yorn committed Jul 03, 2014 61 62 63 64 65 66 67 68 69 70  \subsection{IPv6} The NetFlow data provided by UNINETT contains only flows between IPv4 hosts. The SpreadRank algorithm may need some modifications to be able to be used on IPv6. The two most important differences between IPv4 and IPv6 for SpreadRank are that IPv6 addresses are a lot longer and will therefore not fit in the current data types. Additionally, another traffic pattern will be observed regarding home servers. Where many home servers currently share their public IPv4 address with clients due to the use of technologies such as NAT, IPv6 makes it possible to run the server and client on different IPv6 addresses. Additionally, the use of privacy extensions in IPv6 (a technology that lets clients randomise their IP address for anonymity) will prove to be both a challenge and an advantage for SpreadRank.  yorn committed Jul 04, 2014 71 It is a challenge because it will lead to more vertices; in the current model, every observed IP address + port number pair is a vertex, which means that a new vertex is created every time a client switches IP address.  yorn committed Jul 03, 2014 72 73 Randomised IP addresses are also an advantage, because it is not feasible to run a server on a randomised IP address. This may make it easier to designate vertices as clients at a very early stage, which reduces computation time.