The anti-spam and anti-malicious bot PHP tool, Bad Behavior has recently caused me quite a headache.
It started as a simple issue. Our in-house RSS feed polling component could not pull a feed from one specific site, returning a “403 Bad Behavior”. I’d never seen this specific status string of the 403 response code, and it’s a non-standard status string based on https://en.wikipedia.org/wiki/HTTP_403.
When getting the RSS feed (http://blogs.pb.com/pbsoftware/feed/) from Chrome or Firefox it showed up just fine. This behavior led me to think this was some kind of a bug in our component based on the format of the source RSS feed. I tried validating the feed using the tool at http://www.validome.org/rss-atom/validate. Everything checked out as valid.
With the knowledge that the feed was not the cause of the problem, I decided to try emulating our custom polling agent through cURL and see if I could reproduce the issue we were experiencing:
curl -I -H "User-Agent: Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36" "http://blogs.pb.com/pbsoftware/feed/"
In the response headers I received:
HTTP/1.1 200 OK Date: Wed, 25 Feb 2015 20:00:35 GMT Server: Apache/2.4.7 (Ubuntu) X-Powered-By: PHP/5.5.9-1ubuntu4.5 Set-Cookie: bb2_screener_=1424894435+54.92.202.5; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=-1424894434; path=/pbsoftware/ Set-Cookie: wfvt_4241898385=54ee29e3d3b4b; expires=Wed, 25-Feb-2015 20:30:35 GMT; Max-Age=1800; path=/; httponly X-Pingback: http://blogs.pb.com/pbsoftware/xmlrpc.php Last-Modified: Tue, 24 Feb 2015 18:41:21 GMT X-Robots-Tag: noindex,follow Vary: User-Agent Content-Type: text/html
I wasn’t getting a 403 response despite the fact that the user agent I specified in my cURL command was actually the same one we used in polling the feed through our component. Additionally, we sent Accept-Encoding: gzip and Connection: Keep-Alive. So to get as close as possible to what we were explicitly doing with our custom polling agent, I re-ran the cURL command as:
curl -I -H "Accept-Encoding: gzip" -H "Connection: Keep-Alive" -H "User-Agent: Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36" "http://blogs.pb.com/pbsoftware/feed/"
Once again, I received a 200 response with no issue. At this point I was confused and figured that maybe our custom polling agent was being blocked on the server side based on some sort of IP blacklist. To test this, I re-ran the above cURL command from the machine that was hosting the custom polling agent. Still, I received a 200 response. Frustration was setting in and I began to think about the overall HTTP handshake and whether I was missing something. I decided to go to verbose debugging and began looking at the verbose traffic that cURL was sending.
So, turning on the verbose option (-v) in cURL got me the answer:
curl -v -I -H "Accept-Encoding: gzip" -H "Connection: Keep-Alive" -H "User-Agent: Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36" "http://blogs.pb.com/pbsoftware/feed/" * Hostname was NOT found in DNS cache * Trying 166.78.238.221... * Connected to blogs.pb.com (166.78.238.221) port 80 (#0) > HEAD /pbsoftware/feed/ HTTP/1.1 > Host: blogs.pb.com > Accept: */* > Accept-Encoding: gzip > Connection: Keep-Alive > User-Agent: Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36
Note the bolded header. The Accept header was the only difference between what our custom polling agent was sending and what cURL was sending during this test (it did so implicitly). Not sending an Accept header conforms to RFC2616 (http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html) and means that we accept all content types. Apparently something on the source server was checking for this header and blocking traffic that did not explicitly include it. That something turned out to be the Bad Behavior component. We’ve never explicitly butted heads with this piece of software so finding out that it was blocking us from polling a site’s RSS feed based on it’s need to see an Accept header was very enlightening.
To solve for the issue we were experiencing with this site, I modified our custom polling agent to send an Accept header as follows: Accept: */*. With this change, I expect that we will be able to cleanly pick up more RSS feeds as Bad Behavior seems to have some reasonable installed base. Anything that helps us find more quality expert content is a big win.