I've always considered statistical analysis a dry subject, that is until I had to make sense of my syslog. I wasn't too keen when I first started out on the task because of the sheer numbers involved, several hundred thousand lines of logged messages to be precise. Since there's more than one way to interpret numbers I decided that the best way to organize the information was to process it through a database and generate graphs for a number of different relationships.
This article shows how to data mine a Unix based server's syslogs using PostgreSQL's implementation of R (pl/r).
Here's what you need to know to walk through the steps of gathering data and generating relevant statistics:
These are the rights and privileges that you'll need:
The objective is to collect, collate and analyze the firewall activity from one machine using graphs. This server is assumed to be using a Linux based distribution (Debian) and the default shell is bash. The firewall is iptables. The R data analysis and graphics language is version 1.8.1 (build 2003-11-21). The PostgreSQL server is version 7.4RC2 using the PL/R procedural language binding for R (version plr-0.5.4-alpha).
R is the fast food of graphical development packages. It is a fantastically powerful interpretive language. It's been designed for finding patterns among large gobs of numbers, that's what statistical analysis is all about. There's a lot of overlap between what Postgres can do with numbers and that of R.
The R language can be installed using Debian's advanced package tool using the command:
apt-get install r-base-core r-base-html r-recommended
You can download the source code from the Debian site using the
following command line instruction:
apt-get source postgresql
The PostgreSQL source code is also available as a tar ball and deb
and rpm packages. It can be downloaded from http://postgres.org.
Compiling the pl/r module requires the Postgres source code. Further instructions are available at http://www.joeconway.com/plr/. Compile the pl/r module and then install it into the Postgres library directory, which is normally /usr/local/pgsql/lib. You need to set the environment variable R_HOME (refer to the plr documentation).
The psql client was used to invoke these SQL commands to define the firewall table:
CREATE DATABASE firewall;
CREATE TABLE master (hit text);
CREATE TABLE firewall (
d_stamp time stamp,
The pl/pgsql language is now installed by using:
createlang -d firewall -U postgres plpgsql
The pl/r language is installed using these SQL commands:
[ed: See also contrib/plr/plr.sql to define plr]
CREATE FUNCTION plr_call_handler()
AS '$libdir/plr' LANGUAGE C;
CREATE LANGUAGE plr HANDLER plr_call_handler;
Verify the correct installation of the language:
$~ createlang -l -d firewall -U postgres;
The Postgres server startup script is edited to include the path to the R package using the environment variable R_HOME. For example, this is what I use on my machines in the postgresql server script located in /etc/init.d/:
A final test is made by running the following function:
Open a terminal console and invoke the R language by typing 'R'. You should get a '>' prompt. Now type the following:
> x <- 1:10
> y <- 1+x+rnorm(10,0,1)
You've just created two vectors. There are 10 variables 'x' vector numbering from 1 to 10. The y variable has a set of ten vectors where each variable is the solution of the equation '1+x+rnorm(10,0,1)' and rnorm is a statistical function.
Typing this next line will plot x vs y on a graph:
Graphs can be as easy as '123'. Here's an example where two columns in a table are plotted against each other.
Create and populate the table using the following commands:
The function f_graph() generates the graph as a pdf document:
#/sbin/iptables -A INPUT -p all -s 0/0 -j LOG
I used zcat, grep and sed to process the message logs prior to uploading the data into the database.
Getting the desired data from the syslog into the database is a two stage process: the first step is to upload the desired data in its raw form to the table master. The second step is to export only the desired fields into the table firewall.
This screen shot shows an example of commands that parse and then uploads the data into the database. The objective here was to filter out what I didn't want, in my case only those messages that came from the firewall and didn't include the ip address 18.104.22.168.
Here is a typical record from the master table:
Feb 6 07:40:27 SRC=22.214.171.124 DST=126.96.36.199 LEN=78 TOS=0x00 PREC=0x00 TTL=105 ID=52727 PROTO=UDP SPT=1027 DPT=137 LEN=58
Each line is a complete log that needs to be broken down further into individual fields before it can be copied into the table firewall. A time stamp will be generated for each record.
Two user-defined functions are used in this script:
The invocation that copies the desired information from master
to firewall is:
Graph 1: Number of hits by protocol.
Here's the src.
Graph 2: Top 30 ip address Offenders.
Here's the src.
Graph 3: Accumulated Number of hits in a 24 hour period.
Here's the src.
Remember that PL/R is young and is undergoing constant development.
Try avoiding running complex SQL queries from a plr based function. Instead, run them in a plpgsql function.
PL/R is an untrusted language so it can do anything on the server with the same permissions of the PostgreSQL process. Beware! This includes such dangerous actions as erasing the database cluster.
Joe Conway has done a great job porting R to the PostgreSQL environment. Frankly, I don't think I've done justice in showing the neat ways you can create graphs with PL/R. Read the R documentation and you'll fall in love with this language like I did.