Now that the toy LibraryHippo on Heroku is sending periodic e-mails, it's time to provide it with meaningful content to send, by having it scrape a library's website. This should be relatively straightforward, but there's some risk as it's not an operation covered in The Flask Mega-Tutorial.
The production application gathers information from libraries using a combination of App Engine's custom URL Fetch service, an older version of the Beautiful Soup HTML parser (which had to be copied into the application source), and some glue that I wrote. Today I'll try to replicate that using modern, commodity components.
I've heard that the Requests library is a feature-rich and popular library for sending HTTP requests. Knowing no more than that, I'll try it. Beautiful Soup has worked very well for me in the past, but has had a number of significant releases since the version I've used, so I'll try the latest, 4.8.2 as I write.
pip install requests pip install BeautifulSoup4 inv freeze
Library Credential Management
Library websites don't allow unfettered access to their patrons' records, thank
goodness, so it'll be necessary to use at least one patron's credentials during
the "scraping" test. I don't want to embed my credentials in the application
source code, so they have to be stored somewhere else. Had I already integrated
a permanent datastore and implement user management, the credentials would be
tied to the LibraryHippo user's account, but for now I'll read them from
environment variables that I'll save in the
secrets file established in
Sending Email from Heroku.
with the corresponding change to the
When a patron wants to check their library card status at the Waterloo Public Library, they have to visit the login page and provide their credentials, after which they are taken to a summary page that basically just show the number of checkouts and held items, with links to complete lists of their checkouts and holds. That's 4 pages visited, and possibly some hidden redirects, plus one if they manually logout. I planned to have the automated components follow the same path.
My initial thought was that LibraryHippo could login directly, just by posting credentials to the website, but it failed miserably. The reason: hidden fields. The login page looks like it has 4 inputs, counting the "Submit" button as an input:
But there are actually 6. The four we see and two hidden ones, called
When I submit the form without them, the login fails. The
lt field's value
is different every time I visit the login page, so I assume the server is using
it as a session identifier or some such. The only way to have a successful login
is to harvest that value from the login page. So the card-checking flow must be:
- request the login page
- find the hidden form fields
- submit the login form, including the configured credentials as well as the hidden field values
- read the response and find the links to the checkouts and holds pages
- request the checkouts page and read the results
- request the holds page and read the results
- request the logout page, to logout
The first 3 steps are covered by the
login method on my new
Starting from the top, there are a few things to note:
- I import key classes from Beautiful soup (
bs4) and the Requests library
- The login URL is hard-coded here. You have to start somewhere.
- I'm passing
check_cardmethod, which is the principal entry point into this class
check_cardimmediately creates a
requests.Sessionobject to communicate with the outside world. It's possible to call methods like
postdirectly on the
requestsmodule, but the Session class provides session management by tracking cookies, pools connections, and can persist parameters across requests. Use of the
Sessionclass is the key to having the library website grant access on subsequent requests. Under Google App Engine, I had to write request decorators to handle the cookies and session management
loginfirst requests the login page as discussed, and parses it using Beautiful Soup before passing to
get_form_fieldsto find all the hidden and special (e.g. "submit") fields to ensure the right values are posted. Not how easy the
find_allmethod makes it to locate all the input fields.
loginthen fills in values specific to this card: patron name, card number ("code") and PIN, before
posting the result and returning it to
check_cardfor future processing
Finding the checkouts and holds pages
Once a user logs into the WPL, they see a summary page that contains rather a lot of personal information that they probably don't care to see every day, as well as some links that would allow them to update their account and, most importantly, a link to their holds and one to their checkouts, right at the bottom of the box above the "Log Out" button.
It's those last two that we'll be going after. Unfortunately, the links aren't clearly marked with an ID or even a class:
The URLs vary from patron to patron, so we can't hard-code them. I'll cheat a little and look for links that end in "/holds" or "/items":
Beautiful Soup looks on the summary page for
<a> tags whose
the supplied regular expressions ("ends with /holds" or "ends with /items") and
returns the results. Indexing by
"href" returns that attribute's value.
Since the URLs were relative, I join them to the original login URL to getd
Loading the Holds
The hold page repeats the same personal information from the summary page, and then lists all the patron's holds in a table.
And the HTML behind the table starts like this:
Each significant cell in the table has a
class indicator that can be used to
interpret the contents. Note that since the table is actually part of a form,
where the patron can choose to cancel, freeze, or change the pickup location of
an item, some
td elements contain input controls, slightly complicating the
parsing. Still, it's not that difficult to extract the information:
I load the page, find the table, and iterate over rows with the
class, extracting values to shove in a hold object, which is just a dictionary.
the default action is to store the contents of the
td, but the "Mark" column
is just used to cancel holds, and conveys no information, so I drop it. The
"Pickup" column always contains a number of selections, so I'm carful to grab
option element that is "selected". Finally the "Freeze" column is
effectively a boolean: if the
input has a "checked" attribute, the hold is
check_card, I just loop over the holds, printing a
dl for each
one, listing the attributes. It's not pretty, and would be better as a Jinja
template, but it's good enough for a proof of concept.
Loading the Checkouts
Parsing the Checkouts was to have been the same as parsing the holds, so I was
going to omit it, but that plan fell through when I found that the checkouts
page's HTML is malformed in a way that defeated the
tr tags aren't closed in the table:
As a result, Beautiful Soupo saw only the "patFuncTitle" and "patFuncHeaders" rows. The workaround is to install the lxml XML and HTML parser and have BeautifulSoup use it:
pip install lxml inv freeze
Deploying to Heroku
Deploying is straightforward. I use my fancy
inv deploy command and set the new secret environment variables:
heroku config:set "PATRON_NAME=Blair Conrad" heroku config:set CARD_NUMBER=123456789 heroku config:set PIN=9876
Four of nine requirements have been met!
|done||web app hosting|
|done||scheduled jobs (run in UTC)|
|done||scraping library websites on users' behalf|
|next||small persistent datastore|
|custom domain name|