LibraryHippo 2020 - Scraping Library Websites
Now that the toy LibraryHippo on Heroku is sending periodic e-mails, it's time to provide it with meaningful content to send, by having it scrape a library's website. This should be relatively straightforward, but there's some risk as it's not an operation covered in The Flask Mega-Tutorial.
The production application gathers information from libraries using a combination of App Engine's custom URL Fetch service, an older version of the Beautiful Soup HTML parser (which had to be copied into the application source), and some glue that I wrote. Today I'll try to replicate that using modern, commodity components.
Gathering requirements
I've heard that the Requests library is a feature-rich and popular library for sending HTTP requests. Knowing no more than that, I'll try it. Beautiful Soup has worked very well for me in the past, but has had a number of significant releases since the version I've used, so I'll try the latest, 4.8.2 as I write.
pip install requests pip install BeautifulSoup4 inv freeze
Library Credential Management
Library websites don't allow unfettered access to their patrons' records, thank
goodness, so it'll be necessary to use at least one patron's credentials during
the "scraping" test. I don't want to embed my credentials in the application
source code, so they have to be stored somewhere else. Had I already integrated
a permanent datastore and implement user management, the credentials would be
tied to the LibraryHippo user's account, but for now I'll read them from
environment variables that I'll save in the secrets
file established in
Sending Email from Heroku.
with the corresponding change to the Config
.
Implementation
When a patron wants to check their library card status at the Waterloo Public Library, they have to visit the login page and provide their credentials, after which they are taken to a summary page that basically just show the number of checkouts and held items, with links to complete lists of their checkouts and holds. That's 4 pages visited, and possibly some hidden redirects, plus one if they manually logout. I planned to have the automated components follow the same path.
Login
My initial thought was that LibraryHippo could login directly, just by posting credentials to the website, but it failed miserably. The reason: hidden fields. The login page looks like it has 4 inputs, counting the "Submit" button as an input:
But there are actually 6. The four we see and two hidden ones, called lt
and _eventId
:
When I submit the form without them, the login fails. The lt
field's value
is different every time I visit the login page, so I assume the server is using
it as a session identifier or some such. The only way to have a successful login
is to harvest that value from the login page. So the card-checking flow must be:
- request the login page
- find the hidden form fields
- submit the login form, including the configured credentials as well as the hidden field values
- read the response and find the links to the checkouts and holds pages
- request the checkouts page and read the results
- request the holds page and read the results
- request the logout page, to logout
The first 3 steps are covered by the login
method on my new WPL
class:
Starting from the top, there are a few things to note:
- I import key classes from Beautiful soup (
bs4
) and the Requests library - The login URL is hard-coded here. You have to start somewhere.
- I'm passing
patron
,number
, andpin
into thecheck_card
method, which is the principal entry point into this class check_card
immediately creates arequests.Session
object to communicate with the outside world. It's possible to call methods likeget
andpost
directly on therequests
module, but the Session class provides session management by tracking cookies, pools connections, and can persist parameters across requests. Use of theSession
class is the key to having the library website grant access on subsequent requests. Under Google App Engine, I had to write request decorators to handle the cookies and session managementlogin
first requests the login page as discussed, and parses it using Beautiful Soup before passing toget_form_fields
to find all the hidden and special (e.g. "submit") fields to ensure the right values are posted. Not how easy thefind_all
method makes it to locate all the input fields.login
then fills in values specific to this card: patron name, card number ("code") and PIN, beforepost
ing the result and returning it tocheck_card
for future processing
Finding the checkouts and holds pages
Once a user logs into the WPL, they see a summary page that contains rather a lot of personal information that they probably don't care to see every day, as well as some links that would allow them to update their account and, most importantly, a link to their holds and one to their checkouts, right at the bottom of the box above the "Log Out" button.
It's those last two that we'll be going after. Unfortunately, the links aren't clearly marked with an ID or even a class:
The URLs vary from patron to patron, so we can't hard-code them. I'll cheat a little and look for links that end in "/holds" or "/items":
Beautiful Soup looks on the summary page for <a>
tags whose href
match
the supplied regular expressions ("ends with /holds" or "ends with /items") and
returns the results. Indexing by "href"
returns that attribute's value.
Since the URLs were relative, I join them to the original login URL to getd
absolute URLs.
Loading the Holds
The hold page repeats the same personal information from the summary page, and then lists all the patron's holds in a table.
And the HTML behind the table starts like this:
Each significant cell in the table has a class
indicator that can be used to
interpret the contents. Note that since the table is actually part of a form,
where the patron can choose to cancel, freeze, or change the pickup location of
an item, some td
elements contain input controls, slightly complicating the
parsing. Still, it's not that difficult to extract the information:
I load the page, find the table, and iterate over rows with the patFunEntry
class, extracting values to shove in a hold object, which is just a dictionary.
the default action is to store the contents of the td
, but the "Mark" column
is just used to cancel holds, and conveys no information, so I drop it. The
"Pickup" column always contains a number of selections, so I'm carful to grab
the option
element that is "selected". Finally the "Freeze" column is
effectively a boolean: if the input
has a "checked" attribute, the hold is
frozen.
Back in check_card
, I just loop over the holds, printing a dl
for each
one, listing the attributes. It's not pretty, and would be better as a Jinja
template, but it's good enough for a proof of concept.
Loading the Checkouts
Parsing the Checkouts was to have been the same as parsing the holds, so I was
going to omit it, but that plan fell through when I found that the checkouts
page's HTML is malformed in a way that defeated the html.parser
library.
Some tr
tags aren't closed in the table:
As a result, Beautiful Soupo saw only the "patFuncTitle" and "patFuncHeaders" rows. The workaround is to install the lxml XML and HTML parser and have BeautifulSoup use it:
pip install lxml inv freeze
Deploying to Heroku
Deploying is straightforward. I use my fancy inv deploy
command and set the new secret environment variables:
heroku config:set "PATRON_NAME=Blair Conrad" heroku config:set CARD_NUMBER=123456789 heroku config:set PIN=9876
And voila:
Progress
Four of nine requirements have been met!
done | web app hosting | |
done | scheduled jobs (run in UTC) | |
done | scraping library websites on users' behalf | |
next | small persistent datastore | |
social authentication | ||
done | sending e-mail | |
nearly free | ||
job queues | ||
custom domain name |