The biggest difference is that the old Feedsearch library has now been replaced with the new Feedsearch Crawler library, which is available for use in other projects as a PyPI package. The crawler is now able to make HTTP requests asyncronously, drastically speeding up the number of URLs that can be checked for feeds in the same amount of time, which in turn increases the feed detection rate in cases where the feed URL is not properly advertised.
The second new feature is that all discovered feeds are now saved to a database, along with the website and the URL paths where they were discovered. In this way we can now return a list of all feeds that have been discovered on a site, and remove the need to crawl a site or URL every time a feed search is requested.
In addition to these, I've extended the results provided by the API to add few new fields:
- content_length: The length of the feed in bytes at the time it was last seen.
- is_podcast: Whether or not the feed contains valid podcast elements and enclosures.
- last_seen: The date that the feed was last crawled by Feedsearch.
- last_updated: The date that the feed was last updated by the publisher.
- velocity: A calculation of the mean number of entries per day in the feed, at the time that the feed was last crawled.
The Feedsearch Crawler is built as a separate library from the API, so that it's available for others to use, fork, and extend as they see fit.
I had been thinking for a few years now that what was really needed for feed detection was to use a proper Web crawler. As I kept upgrading the original Feedsearch library to improve detection rates and add features such as Favicon fetching, the time needed perform a search kept increasing, due to the synchronous design of the Requests HTTP library. It was at the time that I started adding asynchronous HTTP requests to Feedsearch, and therefore requiring that I keep track of which URLs have been requested, that I realised I was just building a crappier Web crawler, without a proper architecture. So I built my own, because why not?
The design and features of the base crawler itself are nothing special, but the FeedsearchSpider built on top of it is meant to be as discriminating as possible in what it crawls, while still finding as many feeds as possible. In order to do this the spider uses a number of regular expressions to filter page links that may lead to feeds, and then prioritises them for the crawl queue.
Feedsearch Gateway API
The API is built as the Feedsearch Gateway project, and is responsible for all the custom functionality that is not required in the generic crawler library, such as providing the actual API and dealing with saving results.
When crawling websites, it's not particularly polite to use a lot of resources, and so I needed the ability to cache the results. Because I already had Feedsearch running on AWS Lambda, I decided to use DynamoDB for the database. Whenever a search is completed, the following information is saved to the database:
- All feeds found in the crawl and their metadata.
- The host URL of the site and the time it was crawled.
- The URL of the path that was crawled, the time it was crawled, and the URLs of the feeds that were found from that path.
When a search is started, the API first queries the database for the feeds that were already found and the site information. If the search is for the host URL of the site (e.g. https://example.com), then the site is only crawled again if it hasn't been crawled recently, and all feeds that have ever been found to belong to the site are returned.
If the search contains a path (e.g. https://example.com/path/testing/) then the database is checked to see if that path has already been crawled. If the path has been crawled recently, then the list of URLs found at that path are checked against the sites feeds, and the matching feeds are returned. When the path is searched, only feeds found from the crawl of that path are returned. This is done to increase the chance that only feeds relevant to that particular query are returned, especially if the site contains a lot of feeds.
Finally, the Feedsearch site itself is run behind Cloudflare Workers. Cloudflare doesn't cache HTML by default, but because the API homepage is designed to be as static as possible, it makes sense to cache it with the Workers cache API, so that we don't have to make a request to the Lambda function every time the page is requested. Instead, only requests to the API route are forwarded to the Lambda.