This little article is trying to find the "real user" browsing hits on my Homepage. All logs are gathered in splunk. In the next part, "crawlers" is a synomyn for Robots, Crawlers and Spiders - hence non-live machines.

Step - Filter page views

First, we need to filter out include what actually is a "view" and not REST, saving pages, uploading stuff etc. For Confluence, all views are in one of 2 forms:

...

Looking into the log files, look at the User Agent string, often these have a Bot-like name, but nowadays , many crawlers acts as a normal browser and are not identifiable via the User Agent.

So, we try to eliminate them:

...

Step - Eliminate Monitoring

Monitoring Tools monitoring can fill a lot in the logs, ; to control and identify these, I have ensured the monitoring tool is only monitoring at at a special URL: /display/public/HealthCheckPage.

...

To remove hits from IP-Addresses that have looked at robots.txt, I have created a lookup to a csv file.

So a scheduled Report is running hourly:

Kodeblok
index=apache robots.txt clientip="*" \| table clientip

...

As many crawlers use browser like User Agents and acts like real browsers, looking into my logs I see a large number of hits from them, so I have taken the assumption that more than 100 hits on the same URI within 30 days states that it is not a person using a browser.

So a scheduled Report is running daily:

Kodeblok
index=apache AND host=moserver AND (uri="/display/" OR uri="/viewpage.action/") \| stats count by uri clientip \| where count>100

...

Content

Rum-værktøjer

Versioner sammenlignet

Gammel version 4

Ny version 5

Nøgle

Step - Filter page views

Step - Eliminate Monitoring

Content

Rum-værktøjer

Sidehistorik

Versioner sammenlignet

Gammel version 4

Ny version 5

Nøgle

Step - Filter page views

Step - Eliminate Monitoring