commands.tex 4.29 KB
Newer Older
yorn's avatar
yorn committed
1 2 3 4 5 6 7 8 9
\begin{landscape}
\chapter{Commands}

During this study,
SpreadRank was run on YARN, and results were written to text files.
This appendix will show the commands used to convert NetFlow to \gls{csv},
 start SpreaRank with these \gls{csv} files,
 and filter the results.

yorn's avatar
yorn committed
10
\newpage
yorn's avatar
yorn committed
11 12
\section{Convert NetFlow to CSV}
NetFlow can be converted to a \gls{csv} file, with non-UDP and non-TCP flows filtered out, as well as flows between port numbers that are both over 1024.
yorn's avatar
yorn committed
13
This conversion is done using FlowConvert.sh, though in order to generate separate \gls{csv} files, FlowConvert must be run for each time frame that a \gls{csv} file should span.
yorn's avatar
yorn committed
14 15 16 17 18 19 20 21 22 23 24 25 26 27
This snippet of \gls{bash} code will run FlowConvert.sh 31 times, for every day of the month.
The end result will be a directory named \verb"trd_gw1_12_filtered.csv" with 31 files.

\begin{verbatim}
	seq 1 9 | while read nr
	do
	    sh FlowConvert.sh -R trd_gw1/12/0$nr | cat > trd_gw1_12_filtered.csv/part-0000$nr &
	done
	seq 10 31 | while read nr
	do
	    sh FlowConvert.sh -R trd_gw1/12/$nr | cat > trd_gw1_12_filtered.csv/part-000$nr &
	done
\end{verbatim}

yorn's avatar
yorn committed
28
\newpage
yorn's avatar
yorn committed
29 30
\section{Execute SpreadRank}
In order to execute SpreadRank on YARN,
yorn's avatar
yorn committed
31 32 33 34 35
 the application must be submitted to the YARN manager,
 so that it can deploy the SpreadRank jar file to all workers.
The command takes as arguments the path of the jar file,
 and the fully qualified class-name of both Giraph.
Giraph takes the fully qualified class-name of the SpreadRank computation class.
yorn's avatar
yorn committed
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
Additionally, a format for reading the graph at the start, and writing the graph at the end must be provided.
In this case, we use a directory to store the graph data.
The directory contains chunks of the full graph.

\begin{verbatim}
	yarn \
	 jar giraph-rank-1.1.0-SNAPSHOT-for-hadoop-2.3.0-cdh5.0.1-jar-with-dependencies.jar \
	 org.apache.giraph.GiraphRunner no.uninett.yorn.giraph.computation.SpreadRank \
	 -eif no.uninett.yorn.giraph.format.io.NetflowCSVEdgeInputFormat \
	 -eip /user/hdfs/trd_gw1_12_filtered.csv \
	 -vof no.uninett.yorn.giraph.format.io.RankVertexOutputFormat \
	 -op /user/hdfs/rank-out/IPSpreadRank_gw1_12 \
	 -wc org.apache.giraph.worker.DefaultWorkerContext \
	 -w 16 \
	 -yj giraph-rank-1.1.0-SNAPSHOT-for-hadoop-2.3.0-cdh5.0.1-jar-with-dependencies.jar
\end{verbatim}


yorn's avatar
yorn committed
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
\newpage
\section{Filter results}
The raw results from SpreadRank, even though outputted by a custom \verb"OutputFormat", contain a lot of information that is not relevant.
This simple \gls{bash} script will remove all vertices with a \gls{depth} of 1 or less, and all vertices with no spreading or no clients.
The program is built using these components:
\begin{itemize}[label={$\bullet$}]
\item \verb"cat": concatenate all parts of the output
\item \verb"grep": filter out all values equal to 0 or 1
\item \verb"sort": make sure that the highest spreading is returned first
\item \verb"sed": anonymize the results by removing IP addresses though keeping port numbers
\item \verb"less": easier reading of the results in a terminal
\end{itemize}
Components can taken out if desired, to change the results.
Total execution time of this command on one month worth of NetFlow data took about 10\~15~seconds on UNINETT's server cluster.

\begin{verbatim}
cat rank-out/IPSpreadRank_gw1_12/part-m-* | egrep -vw '[01]' | sort -hrk 3 | sed 's/^.*://' | less
\end{verbatim}

\newpage
\section{Aggregate}
This \gls{bash} script can be used to determine how often a protocol engages in spreading.
It will output two numbers per line, the second being the protocol number, the first being the amount of services observed.
The program is built using these components:
\begin{itemize}[label={$\bullet$}]
\item \verb"cat": concatenate all parts of the output
\item \verb"grep": filter out all values equal to 0 or 1
\item \verb"sed": remove IP addresses, otherwise the list would show all IP addresses with count 1
\item \verb"cut": remove all but the first value (port number)
\item \verb"sort": set the port numbers in order, required for \verb"uniq".
\item \verb"uniq": aggregate the port numbers, and show count
\end{itemize}
Total execution time of this command on one month worth of NetFlow data took about 10\~15~seconds on UNINETT's server cluster.

\begin{verbatim}
cat rank-out/IPSpreadRank_gw1_12/part-m-* | egrep -vw '[01]' | sed 's/^.*://' | cut -f1 | sort | uniq -c
\end{verbatim}

yorn's avatar
yorn committed
92
\end{landscape}