XML and Immortal Docments

I just read Jeff Haung’s A Manifesto for Preserving Content on the Web. He made some good suggestions (seven of them) to help keep web content available as technical progress works hard to erase everything digital that has gone before.

I don’t know if everything published to the web deserves to be saved but much of it does and it’s a shame that we don’t have some industry standard way to preserve old websites. Jeff notes the that Wayback Machine and Archive.org preserve some content but are subject to the same dilemma as the rest of web–eventually every tech dies of it’s native form of link rot.

For longer than I care to admit (11 years!), I’ve been posting my own thoughts to my own WordPress instance. But one day WordPress or me will depart this node of existence. I’m considering migration to a hosted solution and something like Jekyll. That may well postpone the problem but not solve it. I could archive my words on a CD of some sort. But will my decedents be able to parse WordPress or Jekyll or any contemporary file format?

While I like the idea of printing PDFs to stone tablets from a perversity stand point what is really needed is a good articulation of the problem and a crowd sourced, open source solution.

Jeff’s first suggestion is pretty good: “return to vanilla HTML/CSS.” But what is version of HTML/CSS is vanilla? The original version? The current version? Tomorrow’s version? That is the problem with living tech! It keeps evolving!

I would like to suggest XML 1.1. It’s not perfect but its stable (i.e. pretty dead, unlikely to change), most web documents can be translated into it, and most importantly we have it already.

I know that XML is complex and wordy. I would not recommend XML for your web app’s config file format or build system’s make file. But as an archiving format I think XML would be pretty good.

If all our dev tools, from IDEs to blog editors, dumped an archive version of our output as XML, future archaeologists could easily figure out how to resurrect our digital expressions.

As an added bonus, an archive standard based on XML would help services like Wayback Machine and archive.org do their jobs more easily.

Even better, it would be cool if we all chip in to create a global XML digital archive. An Esperanto for the the divergent digital world! We could keep diverging our tech with a clear conscious and this archive would be the place for web browsers and search engines to hunt for the ghosts of dead links.

Now there are all sorts of problems with this idea. Problems of veracity and fidelity. Problems of spam and abuse. We would have to make the archive uninteresting to opportunists and accept some limitations. A good way to solve these type of problems is to limit the archive to text-only written in some dead language, like Latin, where it would would be too much effort to abuse (or that abuse would rise to the level of fine art).

What about the visual and audio? Well, it could be described. Just like we (are supposed to do) for accessibility. The descriptions could be generated by machine learning (or people, I’m not prejudiced against humans). It just has to be done on the fly without out human initiation or intervention.

Perfect! Now, everything time I release an app, blog post, or video clip, an annotated text description written in Latin and structured in XML is automagically archived in the permanent collection of human output.