This little article is trying to find the "real user" browsing hits on my Homepage. All logs are gathered in splunk. In the next, "crawlers" is a synomyn for Robots, Crawlers and Spiders - hence non-live machines.

A log sample:

Step - Filter page views

First, we need to filter out what actually is a "view" and not REST, saving pages, uploading stuff etc. For Confluence, all views are in one of 2 forms:

(uri="*/display/*" OR uri="*/viewpage.action/*")

All other URI's are not relevant...

Step - Eliminate all "bots"

Looking into the log files, look at the User Agent string, often these have a Bot-like name, but nowadays, many crawlers acts as a normal browser.

So, we try to eliminate them:

useragent!="*bot*" useragent!="*spider*" useragent!="*facebookexternalhit*" useragent!="*crawler*" useragent!="*Datadog Agent*"

Step - Eliminate Monitoring

Monitoring Tools can fill a lot in the logs, to control and identify these, I have ensured the monitoring tool is monitoring at /display/public/HealthCheckPage.

Hence, to filter it out:

uri!="/display/public/HealthCheckPage"

Step - Eliminate hosts that has looked at robots.txt

To remove hits from IP-Addresses that have looked at robots.txt, I have created a lookup to a csv file:

index=apache robots.txt clientip="*" | table clientip

Stored in the file robots_spiders.csv

root@splunkserver:/splunk/etc/apps/moseisleymonitoring/lookups# head robot_spiders.csv
clientip
"216.244.66.237"
"77.75.76.163"
"77.75.77.62"
"216.244.66.237"
"77.75.78.162"
"216.244.66.237"
"77.75.76.165"
"37.9.113.190"
"106.120.173.75"

Step - Eliminate all "hard hitting hosts"

As many crawlers

Content

Rum-værktøjer

Step - Filter page views

Step - Eliminate all "bots"

Step - Eliminate Monitoring

Step - Eliminate hosts that has looked at robots.txt

Step - Eliminate all "hard hitting hosts"

Content

Rum-værktøjer

Parsing Apache Logs with Splunk for real user hits....

Step - Filter page views

Step - Eliminate all "bots"

Step - Eliminate Monitoring

Step - Eliminate hosts that has looked at robots.txt

Step - Eliminate all "hard hitting hosts"