This little article is trying to find the "real user" browsing hits on my Homepage. All logs are gathered in splunk. In the next, "crawlers" is a synomyn for Robots, Crawlers and Spiders - hence non-live machines.
No valid Data Center license found
Please go to Atlassian Marketplace to purchase or evaluate Refined Toolkit for Confluence Data Center.Please read this document to get more information about the newly released Data Center version.
Step - Filter page views
First, we need to filter out what actually is a "view" and not REST, saving pages, uploading stuff etc. For Confluence, all views are in one of 2 forms:
(uri="*/display/*" OR uri="*/viewpage.action/*")
All other URI's are not relevant...
Step - Eliminate all "bots"
Looking into the log files, look at the User Agent string, often these have a Bot-like name, but nowadays, many crawlers acts as a normal browser.
So, we try to eliminate them:
useragent!="*bot*" useragent!="*spider*" useragent!="*facebookexternalhit*" useragent!="*crawler*" useragent!="*Datadog Agent*"
Step - Eliminate Monitoring
Monitoring Tools can fill a lot in the logs, to control and identify these, I have ensured the monitoring tool is monitoring at /display/public/HealthCheckPage.
Hence, to filter it out:
uri!="/display/public/HealthCheckPage"
Step - Eliminate hosts that has looked at robots.txt
To remove hits from IP-Addresses that have looked at robots.txt, I have created a lookup to a csv file:
index=apache robots.txt clientip="*" | table clientip
Stored in the file robots_spiders.csv
root@splunkserver:/splunk/etc/apps/moseisleymonitoring/lookups# head robot_spiders.csv clientip "216.244.66.237" "77.75.76.163" "77.75.77.62" "216.244.66.237" "77.75.78.162" "216.244.66.237" "77.75.76.165" "37.9.113.190" "106.120.173.75"
Step - Eliminate all "hard hitting hosts"
As many crawlers use browser like User Agents and acts like real browsers, looking into my logs I see a large number of hits from them, so I have taken the assumption that more than 100 hits on the same URI within 30 days states that it is not a person using a browser:
index=apache AND host=moserver AND (uri="*/display/*" OR uri="*/viewpage.action/*") | stats count by uri clientip | where count>100
Stored in the file hard_hitting_hosts.csv
root@splunkserver:/splunk/etc/apps/moseisleymonitoring/lookups# head hard_hitting_hosts.csv uri,clientip,count "/display/ATLASSIAN/JIRA+as+CMDB/","188.163.74.19",125 "/display/ATLASSIAN/JIRA+as+CMDB/","37.115.189.113",138 "/display/ATLASSIAN/JIRA+as+CMDB/","37.115.191.27",121 "/display/ATLASSIAN/JIRA+as+CMDB/","46.118.159.224",101 "/display/public/HealthCheckPage","77.243.52.139",5732 "/display/slangereden/","5.9.155.37",118 "/display/slangereden/","66.249.64.19",140
To sum up - Conclusion
The best search to get the "human" hits:
(uri="*/display/*" OR uri="*/viewpage.action/*") uri!="/display/public/HealthCheckPage" useragent!="*bot*" useragent!="*spider*" useragent!="*facebookexternalhit*" useragent!="*crawler*" useragent!="*Datadog Agent*" NOT [| inputlookup robot_spiders.csv | fields clientip] NOT [| inputlookup hard_hitting_hosts.csv | fields clientip]