Commit 3dad55b6 authored by yorn's avatar yorn

first commit

# Gitignore from
# Observed files
# Output files
% Document class for project reports and master's theses
% Author: Laurent Paquereau - Department of Telematics, NTNU
\ProvidesClass{MScthesisITEM} [2012/11/10 v1.0 LaTeX2e ITEM MSc thesis document class]
% Base class
% Fixes
% Colors
% Text and encoding
% Layout
% Hyphenation, kerning
% Spacing
% Paragraphs
\clubpenalty = 5000
\widowpenalty = 5000
% Divisors
% Float positioning
% Lists
% Columns
% Tables
% Links
% Graphics
%% In case you need to have subfigures; note that there will be a warning for the caption package
% \let\subcaption\undefined
% \let\subfloat\undefined
% \RequirePackage[bf]{caption}
% \RequirePackage{subcaption}
% \setlength{\abovecaptionskip}{5.5pt}
% \setlength{\belowcaptionskip}{4.2pt}
% Mathematics
\newtheoremstyle{atheorem}% % Name
{}% % Space above
{}% % Space below
{\itshape}% % Body font
{}% % Indent amount
{\bfseries}% % Theorem head font
{}% % Punctuation after theorem head
{ }% % Space after theorem head, ' ', or \newline
{\thmname{#1}\thmnumber{ #2. }\thmnote{ #3}}% % Theorem head spec (can be left empty, meaning `normal')
\newtheoremstyle{adefinition}% % Name
{}% % Space above
{}% % Space below
{}% % Body font
{}% % Indent amount
{\bfseries}% % Theorem head font
{}% % Punctuation after theorem head
{ }% % Space after theorem head, ' ', or \newline
{\thmname{#1}\thmnumber{ #2. }\thmnote{ #3}}% % Theorem head spec (can be left empty, meaning `normal')
\newtheorem{finalremark}[theorem]{Final Remark}
\leavevmode\unskip\penalty9999 \hbox{}\nobreak\hfill
% Code
% Page style
\makeevenhead{ruled}{{\small\thepage \ \ \ \leftmark}}{}{}
\makeoddhead{ruled}{}{}{{\small\rightmark} \ \ \ \thepage}
\createmark{chapter}{left}{shownumber}{}{. }
\createmark{section}{right}{shownumber}{}{. }
% Pages with chapter headings
% Chapter style - based on hansen style,
% helper macros
\parbox{\textwidth}{\chaptitlefont \strut bg\\bg\strut}}
\chaptitlefont\strut ##1\strut}%
% Abstract and acknowledgments
% Table of contents
\setcounter{tocdepth}{2} % show also subsections in the table of content
% Bibliography
\setlength{\bibitemsep}{-\parsep - 0.5em}
% Glossary; undefine some commands from memoir.cls to avoid generating warnings
\renewcommand\@pnumwidth{2em} % adjust page number column width
% Language
% Title page and project description
\newdateformat{dateITEM}{\monthname[\THEMONTH] \THEYEAR}
\newcommand{\newlinetitle}{\par\noindent }
Submission date: & \dateITEM\@date \\
Responsible professor: & \@professor \\
Supervisor: & \@supervisor \\
{\noindent Norwegian University of Science and Technology\\Department of Telematics}
\ No newline at end of file
# The main latex-file
TEXFILE = main
# Fix reference file and compile source
default: full
pdflatex $(TEXFILE); \
bibtex $(TEXFILE); \
makeglossaries $(TEXFILE);\
pdflatex $(TEXFILE);\
pdflatex $(TEXFILE)
# Removes TeX-output files
rm -f *.aux $(TEXFILE).bbl $(TEXFILE).blg *.log *.out $(TEXFILE).toc $(TEXFILE).lot $(TEXFILE).lof $(TEXFILE).glg $(TEXFILE).glo $(TEXFILE).gls $(TEXFILE).ist $(TEXFILE).acn $(TEXFILE).acr $(TEXFILE).alg $(TEXFILE).xdy $(TEXFILE).loa
\noindent Anomaly detection in internet traffic is currently largely based on quantifying traffic data.
This thesis proposes a new algorithm SpreadRank,
which detects spreading of traffic on the internet.
Spreading can be used as an additional metric for traffic anomaly detection, and can be observed from large scale traffic logs from core routers.
Studying spreading is a useful tool in determining the role of an end-host and identifying malicious behaviour.
SpreadRank uses large scale graph processing to calculate spreading from multiple gigabytes of traffic data.
\ No newline at end of file
This diff is collapsed.
During this study,
SpreadRank was run on YARN, and results were written to text files.
This appendix will show the commands used to convert NetFlow to \gls{csv},
start SpreaRank with these \gls{csv} files,
and filter the results.
\section{Convert NetFlow to CSV}
NetFlow can be converted to a \gls{csv} file, with non-UDP and non-TCP flows filtered out, as well as flows between port numbers that are both over 1024.
This conversion is done using, but in order to generate separate \gls{csv} files, FlowConvert must be run for each timeframe that a \gls{csv} file should span.
This snippet of \gls{bash} code will run 31 times, for every day of the month.
The end result will be a directory named \verb"trd_gw1_12_filtered.csv" with 31 files.
seq 1 9 | while read nr
sh -R trd_gw1/12/0$nr | cat > trd_gw1_12_filtered.csv/part-0000$nr &
seq 10 31 | while read nr
sh -R trd_gw1/12/$nr | cat > trd_gw1_12_filtered.csv/part-000$nr &
\section{Execute SpreadRank}
In order to execute SpreadRank on YARN,
YARN must be started with the SpreadRank jarfile,
and the full qualified class-name of the computation class.
Additionally, a format for reading the graph at the start, and writing the graph at the end must be provided.
In this case, we use a directory to store the graph data.
The directory contains chunks of the full graph.
yarn \
jar giraph-rank-1.1.0-SNAPSHOT-for-hadoop-2.3.0-cdh5.0.1-jar-with-dependencies.jar \
org.apache.giraph.GiraphRunner no.uninett.yorn.giraph.computation.SpreadRank \
-eif \
-eip /user/hdfs/trd_gw1_12_filtered.csv \
-vof \
-op /user/hdfs/rank-out/IPSpreadRank_gw1_12 \
-wc org.apache.giraph.worker.DefaultWorkerContext \
-w 16 \
-yj giraph-rank-1.1.0-SNAPSHOT-for-hadoop-2.3.0-cdh5.0.1-jar-with-dependencies.jar
This thesis has stated the current practices in traffic anomaly detection.
Current traffic anomaly detection is based on aggregating flows,
or by identifying high-intensity traffic.
The thesis defines the concept of spreading,
which describes the phenomenon of an end-host initiating the same kind of connections it receives,
where same-kind refers to usage of the same TCP/UDP ports.
It is argued that spreading is uncommon for most protocols on the internet today, which makes it an anomaly.
The usage of graph systems is proposed as a means to measure spreading.
Multiple graph systems are available today, this thesis uses the Giraph system.
Using Giraph, NetFlow information provided by UNINETT is converted to a graph,
where spreading is calculated using SpreadRank, an algorithm introduced in this thesis.
SpreadRank works by making a graph of flow data.
In this graph, vertices represent IP address and port number pairs (\gls{service}s),
and edges represent flows, and have the flow start time as value.
Every service is scored on its longest path towards another service (\gls{depth}),
and on how far it spreads its traffic (\gls{spreading}).
An analysis of the results shows that only five percent of all \emph{\gls{service}s} participate in \emph{\gls{spreading}},
and that for most end-hosts their role can be determined by simply looking at their spreading.
A spreading of zero typically indicates a server, one typically indicates a client (its traffic reaches to the servers) and a spreading of two often indicates a hybrid, but it can also indicate a simple proxy server.
Some protocols have natural spreading, for example \gls{BGP}, \gls{DNS} and \gls{SMTP};
implementations of these protocols are often both server and client.
\Gls{SSH} and \gls{HTTP} are popular protocols which do not have natural spreading,
but do exhibit spreading nevertheless.
\Gls{HTTP} has a high \gls{spreading} due to it being a protocol that is used for multiple purposes.
An \gls{HTTP} \gls{service} may itself use \gls{HTTP} to connect to another \gls{service}.
\Gls{SSH} allows users to ``hop'' from one SSH server to another.
Additionally, many home servers will listen on these ports.
SpreadRank has been successful in identifying spreading, and in doing so can be successfully used to find DNS resolvers,
BGP routers and mail servers on the network.
It does so based on NetFlow data from core routers, without sending data into the network itself.
\section{Future work}
The current implementation of SpreadRank requires the analyst to provide many manual steps.
A better EdgeInputReader would reduce the amount of conversions needed,
and speed up the overall process.
The output could be parsed to automatically find outliers.
It was not known beforehand which attacks were present in the test-data.
The experiment should be repeated with test-data with known attacks.
The best time to do this, is most likely after a large worm outbreak.
\subsection{Real-time monitoring}
The experiment was conducted on static test data.
For a real-life application, real-time monitoring is required,
as this makes it possible to set automatic alarms when something is amiss.
In order to work with real-time data, a sliding window is required,
in which flows are added to the graph as they are observed,
and old flows are removed.
Giraph does not currently support this, but systems that support this do exist, for example GraphX.
The NetFlow data provided by UNINETT contains only flows between IPv4 hosts.
The SpreadRank algorithm may need some modifications to be able to be used on IPv6.
The two most important differences between IPv4 and IPv6 for SpreadRank are that IPv6 addresses are a lot longer and will therefore not fit in the current data types.
Additionally, another traffic pattern will be observed regarding home servers.
Where many home servers currently share their public IPv4 address with clients due to the use of technologies such as NAT,
IPv6 makes it possible to run the server and client on different IPv6 addresses.
Additionally, the use of privacy extensions in IPv6 (a technology that lets clients randomise their IP address for anonymity) will prove to be both a challenge and an advantage for SpreadRank.
It is a challenge because it will lead to more vertices; in the current model every observed IP address + port number is a vertex, which means that a new vertex is created every time a client switches IP address.
Randomised IP addresses are also an advantage, because it is not feasible to run a server on a randomised IP address.
This may make it easier to designate vertices as clients at a very early stage, which reduces computation time.
\section{SpreadRank results}
\caption{Vertices scored with SpreadRank using one day of flows}
\caption{Vertices scored with SpreadRank using one week of flows}
\caption{Vertices scored with SpreadRank using two weeks of flows}
\caption{Vertices scored with SpreadRank using one month of flows}
To test Giraph, two simple algorithms, DOSRank and ReverseDOSRank are created, which count the amount of respectively incoming and outgoing edges from or to a vertex.
This proof of concept will count the amount of flows per \gls{service}.
It can be used to detect if a machine is executing or suffering from a DoS attack,
as these are typically identified by a large number of flows.
These two algorithms will count respectively the amount of outgoing and the amount of incoming edges.
DOSRank (equation~\ref{eq:dosrank}, algorithm~\ref{alg:dosrank}), which will calculate outgoing edges (incoming connections) is trivial; it will set its value during the first superstep and vote to halt.
ReverseDOSRank (equation~\ref{eq:reversedosrank}, algorithm~\ref{alg:reversedosrank}) is a bit more complex;
since incoming edges are not visible in Giraph, the first superstep is used to send a message over every edge.
During the second superstep, every vertex will receive an amount of messages that is equal to its amount of incoming edges.
f(x) = N^{-}(x).
f(y) = N^{+}(y).
result = vertex.getNumEdges();
if superstep = 0 then
result = 0;
\section{Expected results}
The algorithm will find IP addresses that are associated with large amounts of flows.
Opening a large amount of flows is different from sending large amounts of data,
as sending a large amount of data is often a completely legitimate thing to do:
For example, sharing files via peer-to-peer protocols or serving large files over HTTP.
A large amount of flows, however, may be an indication that something is amiss.
It can indicate port scanning, or even targeted attacks like SYN flooding.
However, a large amount of flows can also simply indicate a very active or popular host.
These algorithms are just a proof-of-concept to show that graph systems can be used for traffic analysis.
The algorithm itself is most likely easier to implement using other systems.
A simple MapReduce algorithm could probably yield the same results a smaller amount of time~\cite{Morken352472}.
However, this algorithm does show that Giraph can be used for traffic analysis,
and it opens the door for more complex algorithms, like SpreadRank.
File added
File added
File added
File added
File added
File added
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
%description={A dummy symbol }}
\newglossaryentry{depth}{type=main,name={depth},description={The longest path between end-hosts that copy each others behaviour}}
\newglossaryentry{spreading}{type=main,name={spreading},description={Score for an end-host indicating how far traffic sent from said host spreads}}
\newglossaryentry{natural spreading}{type=main,name={natural spreading},description={Spreading is the intended behaviour of the protocol}}
\newglossaryentry{service}{type=main,name={service},description={An IP address and port number pair}}
\newglossaryentry{client}{type=main,name={client},description={An entity on the network that initiates connections}}
\newglossaryentry{server}{type=main,name={server},description={An entity on the network that answers connections}}
\newglossaryentry{end-host}{type=main,name={end-host},description={An entity on the network, indicated by and IP address}}
\newglossaryentry{nfdump}{type=main,name={nfdump},description={Open source utility to convert NetFlow to text}}
\newglossaryentry{Giraph}{type=main,name={Giraph},description={Open source graph processing system}}
\newglossaryentry{graph}{type=main,name={graph},description={Mathematical structure consisting of vertices and edges}}
\newglossaryentry{3rd layer}{type=main,name={3rd layer},description={3rd layer in the OSI model, the internet is a network where end-hosts are interconnected via routers}}
\newglossaryentry{4th layer}{type=main,name={4th layer},description={4th layer in the OSI model, the internet is a network where end-hosts can communicate directly with each other}}
\newglossaryentry{denial of service}{type=main,name={denial of service},description={A method for making a service unavailable to its users}}
\newglossaryentry{NetFlow}{type=main,name={NetFlow},description={Proprietary data format from Cisco describing traffic flows}}
\newglossaryentry{ephemeral port}{type=main,name={ephemeral port},description={Short-lived automatically allocated temporary port}}
\newglossaryentry{contact port}{type=main,name={contact port},description={Permanent port used to contact a service}}
\newglossaryentry{well-known port}{type=main,name={well-known port},description={Permanent port that has been standardised by the IANA for a specific service}}
\newacronym{ntnu}{NTNU}{the Norwegian University of Science and Technology}
\newacronym{csv}{CSV}{Comma Separated Value}
\newacronym{bash}{Bash}{Bourne Again SHell}