Hit me Please!
Ok, so you got a web site and you want to know if anybody is looking at
it, and if so, what they are looking at and how many times. Lucky for
you, (most) every web server keeps a log of what it's doing, so you can
just go look and see. The logs are just plain ASCII text files, so any
text editor or viewer would work just fine. Each time someone (using a
web browser) asks for one of your web pages, or any componant thereof
(known as URLs, or Uniform Resource Locator), the web server will
write a line to the end of the log representing that request.
Unfortunately, the raw logs are rather crpytic for everyday humans to
read. While you might be able to determine if anybody was
looking at your web site, any other information would require some sort
of processing to determine. A typical log entry might look something
like the following:
192.168.45.13 - - [24/May/2005:11:20:39 -0400] "GET /mypage.html HTTP/1.1" 200 117This represents a request from a computer with the IP address 192.168.45.13 for the URL /mypage.html on the web server. It also gives the time and date the request was made, the type of request, the result code for that request and how many bytes were sent to the remote browser. There will be a line similar to this one for each and every request made to the web server over the period covered by the log. A 'Hit' is another way to say 'request made to the server', so as you may have noticed, each line in the log represents a 'Hit'. If you want to know how many Hits your server received, just count the number of lines in the log. And since each log line represents a request for a specific URL, from a specific IP address, you can easily figure out how many hits you got for each of your web pages or how many hits you received from a particular IP address by just counting the lines in the log that contain them. Yes, it really is that simple. And while you could do this manually with a text editor or other simple text processing tools, it is much more practical and easier to use a program specifially designed to analyze the logs for you, such as the Webalizer. They take the work out of it for you, provide totals for many other aspects of your server, and allow you to visualize the data in a way not possible by just looking at the raw logs.
192.168.45.13 - - [24/May/2005:11:20:39 -0400] "GET /mypage.html HTTP/1.1" 200 117 192.168.45.13 - - [24/May/2005:11:20:40 -0400] "GET /myimage1.jpg HTTP/1.1" 200 231 192.168.45.13 - - [24/May/2005:11:20:41 -0400] "GET /myimage2.jpg HTTP/1.1" 200 432So what can we gather from this exchange? Well, based on the what we learned above, we can count the number of lines in the log file and determine that the server recveived 3 hits during the period that this log file covers. We can also calculate the number of hits each URL received (in this case, 1 hit each). Along the same lines, we can see that the server received 3 hits from the IP address 192.168.45.13, and when those requests were received. The two numbers at the end of each line represent the response code and the number of bytes sent back to the requestor. The response code is how the web server indicates how it handled the request, and the codes are defined as part of the HTTP protocol. In this example, they are all 200, which means everything went OK. One response code you may be very familiar with is the all too common '404 - Not Found', which means that the requested URL could not be found on the server. There are several other response codes defined, however these two are the most common.
And that, in a nutshell, is about all you can accurately determine
from the logs. "But wait!" you might be screaming, "most analysis
program have lots of other numbers displayed!", and you would be
right. Some more obscure numbers can be calculated, like the number
of different response codes, number of hits within a given time
period, total number of bytes sent to remote browsers, etc.. Other
numbers can be implied based on certain assumptions, however those
cannot be considered entirely accurate, and some can even be wildly
inaccurate. Other log formats might be used by a web server as well,
which provide additional information above what the CLF format does,
and those will be discussed shortly. For now, just realize that the
only thing you can really, accurately determine is what IP address
requested which URL, and when it requested that URL, as shown in the
example above.
The Good, the Bad and the Ugly
So now you have a good grasp of how your web server works and what
information can be obtained from its logs, like number of hits (to
the server and to individual URLs), number of IP addresses making
the requests (and how many hits each IP address made), and when
those requests were made. Given just that information, you can
answer questions such as "What is the most popular URL on my site?",
"What was the next most popular URL?", "What IP address made the
most requests to my server?", and "How busy was my server during
this time period?". Most analysis programs will also make it easy
to answer such questions as "What time of day is my web server the
most active?", or "What day of the week is the busiest?". They
can give you an insight into usage patterns that may not be apparent
by just looking at the raw logs. All of these questions can be
answered with completely accurate answers, based just on the simple
analysis of your web server logs. That's the good news!
The bad news? Well, with all the things you can determine by
looking at your logs, there are a lot of things you can't acccurately
calculate. Unfortunately, some analysis programs lead you to believe
otherwise, and forget to mention (particularly commercial packages) that
these are not much more than assumptions and cannot be considered at
all accurate. Like what? you ask.. well, how about those things that
some programs call 'user trails' or 'paths', that are supposed to
tell you what pages and in what order a user travelled through your
site. Or how about the length of time a user spends on your site.
Another less than accurate metric would be that of 'visits', or how
many users 'visited' your site during a given time period. All of
these cannot be accurately calculated, for a couple of different
reasons.. lets look at some of them:
In a typical computer program that you run on your own machine, you can
always determine what the user is doing. They log in, do some stuff,
and when finished, they log out. The HTTP protocol however is different.
Your web server only sees requests from some remote IP address. The
remote address connects, sends a request, receives a response and then
disconnects. The web server has no idea what the remote side is doing
between these requests, or even what it did with the response sent to
it. This makes it impossible to determine things like how long a user
spends on your site. For example, if an IP address makes a request to
your server for your home page, then 15 minutes later makes a request
for some other page on your site, can you determine how long the user
had been at your site? The answer is of course No!. Just because
15 minutes expired between requests, you have no idea what the remote
address was doing between those two requests. They could have hit your
site, then immediately gone somewhere else on the web, only to come back
15 minutes later to request another page. Some analysis packages will
say that the user stayed on your site for at least 15 minutes plus some
'fudge' time for viewing the last page requested (like 5 minutes or so).
This is actually just a guess, and nothing more.
Web servers see requests and send results to IP addresses only. There
is no way to determine what is at that address, only that some
request came from it. It could be a real person, it could be some
program running on a machine, or it could be lots of people all using
the same IP address (more on that below). Some of you will note that
the HTTP protocol does provide a mechanism for user authentication,
where a username and password are required to gain access to a web site
or individual pages. And while that is true, it isn't something that
a normal, public web site uses (otherwise it wouldn't be public!). As
an example, say that one IP address makes a request to your server, and
then a minute later, some other IP address makes a request. Can you
say how many people visited your site? Again, the answer is No!.
One of those requests may have come from a search engine 'spider', a
program designed to scour the web looking for links and such. Both
requests could have been from the same user, but at different addresses.
Some analysis program will try to determine the number of users based
on things like IP address plus browser type, but even so, these are
nothing more than guesses made on some rather faulty assumptions.
In the good old days, every machine that wanted to talk on the internet had it's own unique IP address. However, as the internet grew, so did the demand for addresses. As a result, several methods of connecting to the internet were developed to ease the addressing problem. Take, for example, a normal dial-up user sitting at home. They call their service provider, the machines negotiate the connection, and an IP address is assigned from a re-usable 'pool' of IP addresses that have been assigned to the provider. Once the user disconnects, that IP address is made available to other users dialing in. The home user will typically get a different IP address each time they connect, meaning that if for some reason they are disconnected, they will re-connect and get a new IP address. Given this situation, a single user can appear to be at many different IP addresses over a given time. Another typical situation is in a corporate environment, where all the PCs in the organization use private IP addresses to talk on the network, and they connect to the internet through a gateway or firewall machine that translates their private address to the public one the gateway/firewall uses. This can make all the users within the organization appear as if they were all using the same IP address. Proxy servers are similar, where there can be thousands of users, all appearing to come from the same address. Then there are reverse-proxy servers, typical of many large providers such as AOL, that can make a single machine appear to use many different IP addresses while they are connected (the reverse-proxy keeps track of the addresses and translates them back to the user). Given this situation, can you say how many users visited your site if your logs show 10 requests from the same IP address over an hour? Again, the answer is No!. It could have been the same user, or it could have been multiple users sitting behind a firewall. Or how about if your logs show 10 requests from 10 different IP addresses? Think it was from 10 different users? Of course not. It could have been 10 different users, could have been a couple of users sitting behind a reverse proxy, could have been one or more users along with a search engine 'spider', or it could be any combination of them all.
Ok, so what have we learned here? Well, in short, you don't know who
or what is making requests to your server, and you can't assume that
a single IP address is really a single user. Sure, you can make all
kinds of assumptions and guesses, but that is all they realy are, and
you should not consider them at all accurate. Take the following
example; IP address A makes a request to your server, 1 minute later,
IP address B makes a request, and then 10 minutes later, address A
makes another request. What can we determine from that sequence?
Well, we can assume that two users visited. But what if address A
was that of a firewall? Those two requests from address A could have
been two different users. What if the user at address A got disconnected
and dialed back in, getting a different address (address B) and someone
else dialed in at the same time and got the now free address A? Or
maybe the user was sitting behind a reverse-proxy, and all three requests
were really from the same user. And can we tell what 'path' or 'trail'
these users took while at the web site or how long they remained?
Hopefully, you should now see that the answer to all these things is a
big resounding No! we can't! Without being able to identify
individual unique users, there is no way to tell what an individual
unique user does.
All is not lost however. Over time, people have come up with ways
to get around these limitations. Systems have been written to get
around the stateless nature of the HTTP protocol. Cookies and other
unique identifiers have been used to track individuals, as has various
dynamic pages with back end databases. However, these things are all,
for the most part, external to the protocol, not logged in a standard
web server log, and require specalized tools to analyze. In all other
cases, any programs that claim to analyze these types of metrics should
just be considered guesses based on certain assumptions. One such
example can be found within the Webalizer itself. The concept of a
'visit' is a metric that cannot be accurately reported, yet that is
one of the things that the Webalizer does show. It was added because
of the huge number of requests received from individuals using the
program. It is based on the assumption that a single IP address
represents a single user. You have already seen how this assumption
falls flat in the real world, and if you read through the documentation
provided with the program, you will see that it clearly says the 'visit'
numbers (along with 'entry' and 'exit' pages) are not to be considered
accurate, but more of a rough guess. We haven't touched on entry and
exit pages yet, but they are based on the concept of a 'visit', which
we have already seen isn't accurate. These are supposed to be the
first and last page a user sees while at the web site. If a request
comes in that is considered a new 'visit', then the URL of that request
would be, in theory, the 'Entry' page to the site. Likewise, the last
URL requested in a visit would be the 'Exit' page. Similar to user
'paths' or 'trails', and being based on the 'visit' concept, they are
to be treated with the same caution. One of the funniest metrics I
have seen in one particular analysis program was supposed to tell you
where the user was geographically, based on where the domain name of
the requesting remote address was registered. Clever idea, but completely
worthless. Take for example AOL, which is registered in Virginia.
The program considered all AOL users as living in Virginia, which we
know is not the case for a provider with access points all over the
globe.
Other metrics you CAN determine
Now that you have seen what is possible, you may be thinking that there
are some other things these programs display, and wondering about how
accurate they might be. Hopefully, based on what you have already seen
thus far, you should be able to figure them out on your own. One such
metric is that of a 'page' or 'page view'. As we already know, a web
page is made up of an HTML text document and usually other elements
such as graphic images, audio or other multimedia objects, style sheets,
etc.. One request for a web page might generate dozens of requests for
these other elements, but a lot of people just want to know how many
web pages were requested without counting all the stuff that makes them up.
You can get this number, if you know what type of files you may consider
a 'page'. In a normal server, these would be just the URLs that end with
a .htm or .html extension. Perhaps you have a dynamic site, and your web
pages use an .asp, .pl or .php extension instead. You obviously would
not want to count .gif or .jpg images as pages, nor would you want to
count style sheets, flash graphic and other elements. You could go
through the logs and just count up the requests for whatever URL meets
your criteria for a 'page', but most analysis programs (including the
Webalizer) allows you to specify what you consider a page and will
count them up for you.
Other information
Up to now, we have just discussed the CLF (Common Log Format) log
format. There are others. The most common is called 'combined',
and takes the basic CLF format and adds two new pieces of information.
Tacked on the end is the 'user agent' and 'referrer'. A user agent
is just the name of the browser or program being used to generate the
request to the web server. The 'referrer' is supposed to be the page
that referred the user to your web server. Unfortunately, both
of these can be completely miseading. The user agent string can be
set to anything in some modern browsers. One common trick for Opera
users is to set their user agent string to that of MS Internet Explorer
so they can view sites that only allow MSIE visitors. And the referrer
string, according to the standards document (RFC) for the HTTP protocol,
may or may not be used at the browsers choosing, and if used, does not
have to be accurate or even informative. The apache web server (which
is the most common on the interet) allows other things to be logged,
such as cookie information, length of time to handle the request and
lots of other stuff. Unfortunately, the inclusion and placement of
this information in the server logs are not standard. Another format,
developed by the W3C (world wide web consortium), allows log records
to be made up of many different pieces of information, and their location
can be anywhere in the log entry with a header record needed to map them.
Some analysis programs handle these and other formats better than others.
Analysis techniques
The only true way to get an accurate picture of what your web server is
doing is to look at it's logs. This is how most of the analysis packages
out there get their information, and is the most accurate. Other methods
can be used, with different results. One common method, which was widely
popular for a while, was the use of a 'page counter'. Basically, it was
a dynamic bit included in a web page that incremented a counter and
displayed it's value each time the page was requested. Normally, it was
included in the page as if it were a standard image file. One problem
with this method was that you had to include a different 'image' file
for each page you wanted to track. Another problem occured if the remote
user had image display turned off in their browser, or could not display
images at all (such as in a text based web browser). You could also
easily inflate the number by just hitting the 'reload' button on your
browser over and over again. Similar methods were developed using java
and javascript, in an attempt to get even more information about the
visiting browser, such as screen resolution and operating system type.
Of course, these can easily be circumvented as well. Some companies
set up systems that claim to track your server usage remotely, by
including an image or javascript element on your site which would then
contact the companies system each time the image or javascript element
was requested. These all have the same problems and limitations. In
all of these, you can simply turn off images and java/javascript and
then browse the web site completely uncounted and unseen (except in
the web server logs). Beware of these types of counters and remote
usage sites, they are not quite as accurate as they may lead you to
believe.
Conclusion
It should now be obvious that there are only certain things you can
determine from a web server log. There are some completely accurate
numbers you can generate without question. And then, there are some
wildly inaccurate and misleading numbers you can garner depending on
what assumptions you make. Want to know how many requests generated
a 404 (not found) result? Go right ahead and count them up and be
completely confident with the number you get. Want to know how many
'users' visited your web site? Good luck with that one.. unless you
go 'outside the logs', it will be a hit or miss stab in the dark. But
now you should have a good idea of what is and isn't possible, so
when you look at your usage report, you will be able to determine
what the numbers mean and how much to trust them. You should also
now see that a lot can depend on how the program is configured, and
that the wrong configuration can lead to wrong reqults. Take the
example of 'pages'.. if your analysis software thinks that only URLs
with a .htm or .html extension is a page, and all you have is .php
pages on your site, that number will be completely wrong. Not
because the program is wrong, but because someone told it the wrong
information to base its calculations on. Remember, knowlege is
power, so now you have the power to ask the proper questions and
get the proper results. The next time you look at a server analysis
report, hopefully you will see it in a different light given your
new found knowlege.
For a description of the common terms used by the Webalizer,
Click here!
Copyright (C)2002-2012 by Bradford L. Barrett -
Last modified: 20-Apr-2012