Skip to content

Blog Rewind is harder than I thought

Earlier I told you about an idea I had, which I call Blog Rewind, that would allow you to read or export a blog’s archives in a convenient way.  I got some positive feedback on the idea, and so I wanted to explore it further.  So I spent a few days thinking, reading and writing, trying to hammer down the details of how I would build such a system.  It turns out to be much harder than I originally thought.

The conceptual problem hit me before the technical problems did.  There is no generic way to access a blog’s archives.  (I will talk more about technical details later in the post.  Hang in there if you are a nerd.)  They can be accessed by different URLs, or in some cases can’t be accessed at all.  Some but not all are linked to from the homepage.  Some are formatted by month, some by year, some as a long list.

When I thought about the problem in an abstract way, I realized that I was tackling the same problem that Google faced 10 years ago, namely reading the entire internet, and making some sense out of it.  They did some pretty cool things, and if I want to mimic them, I’d need a lot more resources than I had available to me.

What interests me here is the inherent challenge of developing useful software, rather than specifics of my idea.  It’s a problem seen often in my business of software engineering.  I identified a simple problem with a simple solution.  People would have liked the solution, if I had implemented it as I envisioned it.  But that vision was not practical, once I peeked under the covers even a little bit.  Implementation is still not a foregone conclusion, even in our day of high-level languages and fast processors.

Speaking of implementation, yes, I promised I would talk details.  For the non-techies, I won’t be offended if you stop reading now.  Continue at your own risk.

Recall that I’m trying to solve a simple problem: given a url of an arbitrary blog, can I find a list of either all its pages, or all its archive pages.

Here’s my thought process, approximately:

First, is there an API that all or most blogging software supports?  I found out a little about the xml-rpc stuff that Windows Live Writer uses, but that wouldn’t work since you need a username even for the read-only parts of the API, which I wouldn’t have.

Next I checked a few of the sites that do similar things, like Google Blog Search and Technorati.  Those two both struck out as well.

The next step I considered was the sitemap file.  Most sites submit a sitemap which lists some or all of the pages on that site.  I could find the sitemap using robots.txt, which would make things a little more deterministic.  That would’ve been perfect, but Wordpress does not have a sitemap by default, and Blogger has only a partial one.  MSDN blogs seemed not to have one at all.

Then I started thinking about a custom algorithm.  Every blog will have an rss feed, so maybe I could pull that and then based on two different posts, extrapolate a pattern.  Then I could use the pattern to construct URLs for the archive pages, and then parse each of those.  That opened another whole can of worms though.  Could I find the date in a feed, even if the feed didn’t use the standard YYYY/MM/DD format?  Could I extrapolate the pattern? What if the archives pages didn’t exist by year? By month? By day?  Could I request a huge number of pages in a reasonable amount of time without looking like a DoS attack?  And on and on.  So I decided to table that idea.

I figured my last option would be search APIs.  I checked Ask, Live and Google.  Ask’s didn’t exist and Live’s didn’t work for some reason that I now cannot remember.  Google’s blog search API looked the most promising, so I checked that out a little more.  There were several problems.  I wasn’t really searching for any specific term, just “the oldest post”.  If I used archive as the term, either all posts or none would show up, not so helpful.  So I figured just put in an empty search and check all the results.  Unfortunately, I could get 64 results at most, which wouldn’t be enough for my little blog, let alone the good old ones that you would want to read in the first place.

Now some more fiddling may solve this problem, and I don’t know a lot about this area, so if I missed something, I’d love to hear it.  Still, I don’t think it will be as easy as I hoped.

  • Share/Bookmark

One Trackback/Pingback

  1. The differing mindsets of PMs and developers | Sam Strasser on Friday, January 2, 2009 at 8:50 am

    [...] I wrote about the challenge I faced when trying to make a relatively simple software application.  On the one hand, the idea seemed simple and relatively trivial.  On the other, it turned out to [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*