Cookies, Redirects, and Transcripts - Supercharging urlfetch
LibraryHippo's main function is fetching current library account status for patrons. Since I have no special relationship with any of the libraries involved, LibraryHippo web scrapes the libraries' web interfaces.
The library websites issue cookies and redirects, so I needed to do something to augment the URL Fetch Python API. I wrote a utility class that worked with the urllib2 interface, but that didn't allow me to set the `deadline` argument, and I wanted to increase its value to 10 seconds. I resigned myself to wiring up a version that used urlfetch, when I found Scott Hillman's URLOpener, which uses cookielib to follow redirects and handle any cookies met along the way.
URLOpener looked like it would work for me, with a few tweaks - it didn't support relative URLs in redirects, it doesn't allow one to specify headers in requests, and it lacked one feature that I really wanted - a transcript.
Why a transcript?
The libraries don't provide a spec for their output, so I built the web scraper by trial and error, sometimes putting books on hold or taking them out just to get test data. Every once in a while something comes up that I haven't coded for and the application breaks. In these cases, I can't rely on the problem being reproducible, since the patron could've returned (or picked up) the item whose record was troublesome or some other library state might've changed. I need to know what the web site looked like when the problem occurred, and since the ultimate cause might be several pages back, I need a history.
I started adding a transcript feature to the URLOpener - recording every request and response including headers. As I worked, I worried about two things:
- the
fetch
logic was becoming convoluted, and - the approach was inflexible - what if later I didn't want to follow redirects, or to keep a transcript?
Decorators to the rescue
I decided to separate each bit of functionality - following redirects, tracking cookies, and keeping a transcript - into its own decorator, to be applied as needed. First I teased out the code that followed redirects, with my change to allow relative URLs:
class RedirectFollower():
def __init__(self, fetcher):
self.fetcher = fetcher
def __call__(self, url, payload=None, method='GET', headers={},
allow_truncated=False, follow_redirects=False, deadline=None):
while True:
response = self.fetcher(url, payload, method, headers,
allow_truncated, False, deadline)
new_url = response.headers.get('location')
if new_url:
# Join the URLs in case the new location is relative
url = urlparse.urljoin(url, new_url)
# Next request should be a get, payload needed
method = 'GET'
payload = None
else:
break
return response
After that, the cookie-handling code was easy to put in its own class:
class CookieHandler():
def __init__(self, fetcher):
self.fetcher = fetcher
self.cookie_jar = Cookie.SimpleCookie()
def __call__(self, url, payload=None, method='GET', headers={},
allow_truncated=False, follow_redirects=True, deadline=None):
headers['Cookie'] = self._make_cookie_header()
response = self.fetcher(url, payload, method, headers,
allow_truncated, follow_redirects, deadline)
self.cookie_jar.load(response.headers.get('set-cookie', ''))
return response
def _make_cookie_header(self):
cookieHeader = ""
for value in self.cookie_jar.values():
cookieHeader += "%s=%s; " % (value.key, value.value)
return cookieHeader
Now I had the `URLOpener` functionality back, just by creating an object like so:
fetch = RedirectFollower(CookieHandler(urlfetch.fetch))
Implementing transcripts
I still needed one more decorator - the transcriber.
class Transcriber():
def __init__(self, fetcher):
self.fetcher = fetcher
self. transactions = []
def __call__(self, url, payload=None, method='GET', headers={},
allow_truncated=False, follow_redirects=True, deadline=None):
self.transactions.append(Transcriber._Request(vars()))
response = self.fetcher(url, payload, method, headers,
allow_truncated, follow_redirects, deadline)
self.transactions.append(Transcriber._Response(response))
return response
class _Request:
def __init__(self, values):
self.values = dict((key, values[key])
for key in ('url', 'method', 'payload', 'headers'))
self.values['time'] = datetime.datetime.now()
def __str__(self):
return '''Request at %(time)s:
url = %(url)s
method = %(method)s
payload = %(payload)s
headers = %(headers)s''' % self.values
class _Response:
def __init__(self, values):
self.values = dict(status_code=values.status_code,
headers=values.headers,
content=values.content,
time=datetime.datetime.now())
def __str__(self):
return '''Response at %(time)s:
status_code = %(status_code)d
headers = %(headers)s
content = %(content)s''' % self.values
To record all my transactions, all I have to do is wrap my fetcher one more time. When something goes wrong, I can examine the whole chain of calls and have a better shot at fixing the scraper.
fetch = Transcriber(RedirectFollower(CookieHandler(urlfetch.fetch)))
response = fetch(patron_account_url)
try:
process(response)
except:
logging.error('error checking account for ' + patron, exc_info=True)
for action in fetch.transactions:
logging.debug(action)
Extra-fine logging without rewriting fetch
The exercise of transforming URLOpener
into a series of decorators may seem like just that, an exercise that doesn't provide real value, but provides a powerful debugging tool for your other decorators. By moving the Transcriber
to the inside of the chain of decorators, you can see each fetch that's made due to a redirect, and which cookies are set when:
fetch = RedirectFollower(CookieHandler(Transcriber(urlfetch.fetch)))
The only trick is that the `Transcriber.transactions` attribute isn't available from the outermost decorator. This is easily solved by extracting a base class and having it delegate to the wrapped item.
class _BaseWrapper:
def __init__(self, fetcher):
self.fetcher = fetcher
def __getattr__(self, name):
return getattr(self.fetcher, name)
Then the other decorators extend `_BaseWrapper`, either losing their `__init__` or having them modified. For example, `CookieHandler` becomes:
class CookieHandler(_BaseWrapper):
def __init__(self, fetcher):
_BaseWrapper.__init__(self, fetcher)
self.cookie_jar = Cookie.SimpleCookie()
...
And then the following code works, and helped me diagnose a small bug I'd originally had in my `RedirectFollower`. As a bonus, if I ever need to get at `CookieHandler.cookie_jar`, it's right there too.
fetch = RedirectFollower(CookieHandler(Transcriber(urlfetch.fetch)))
fetch(patron_account_url)
for action in fetch.transactions:
logging.debug(action)