Tuesday, November 23, 2010

Coming in Darcs 2.8: read-only support for old-fashioned repositories

A month ago, we released Darcs 2.5. This version brought performance improvements and a few nice features like trackdown --bisect and UTF8 patch metadata. Since we follow a time-based release schedule, Darcs 2.8, the next major release, will be released in May 2011, and we are at the beginning of the work that will be part of it. So what are we working on now?

One of the hot topics of Darcs 2.8 will be repository formats. As of now, Darcs can work with two kinds of repositories: the old-fashioned (OF) ones and the Hashed ones. OF has been around since the very first Darcs, while the Hashed format has been introduced in Darcs 2.0 in April 2008.

Why Hashed

The reasons why the Hashed format was introduced were robustness and performance.

To understand robustness, one has to know that darcs repositories contains patches (of course), but also a "pristine cache" which is the latest recorded state of the files stored in the repository. It can be rebuilt from the patches of the repository, so it is not really mandatory in theory, however it makes some commands run in reasonable time. The command whatsnew, for instance, show the current unrecorded modifications in your working copy, and works by comparing your working copy with the pristine.

In OF repositories, the pristine directory is simply a plain filesystem tree. The consequences are that if an external program (say Subversion or Unison) adds files to that directory, Darcs will believe they really belong to the pristine cache! Hence, the command whatsnew would tell that you are missing some file in your working copy, and the command record would record a wrong patch! More generally, OF repositories can not be really checked for integrity: Darcs can not know you modified a pristine file by hand.

In Hashed repositories, the pristine is a directory containing files named by the SHA1 hash of their contents, and there is a special file that tells the hash of the root directory of the pristine. Hence, adding bogus files to the pristine directory is harmless: they will never be read. So bring on all your Subversions, Unisons and Dropboxes: Darcs can not longer be tricked.

There is another downside of OF: getting them via HTTP is very slow. Indeed, they contain no list of pristine files. You can not directly copy the remote pristine files in order to get a working copy. So you need to get all the patches, and then locally rebuild a pristine and a working copy, making the delay until getting a working copy linear in the history of the project. On the other hand, the latest working copy of a Hashed repo can be obtained with darcs get --lazy, which does not need to retrieve the history of the project.

The developers are not amused

Those are downsides of OF for users. But we, as developers, are also not very pleased with some parts of the Darcs codebase. The code that handles repositories is not really great and lacks modularity. Moreover, nobody is really motivated to maintain code for OF, since we know Hashed repositories are better and there is ongoing work in order to make Darcs more performant with them. For instance, the work done in Summer of Code 2009 has then been incorporated in Darcs 2.4 and has boosted performance of Darcs a lot. In Darcs 2.5, performance of commands like record and pull have been substantially improved. None of these changes concern OF repositories.

Hence we have decided to stop struggling with old code, and from Darcs 2.8, only read-only access will be provided for OF repositories. This means you will still be able to use the commands get, pull  and send to interact with remote OF repositories. On the other hand writing in an OF repository with record, pull or apply will no longer work, and commands that rely on the pristine files, like whatsnew, diff or dist, will also fail. We will make sure the user interface will remain clear and  helpful in those cases.

For us, this is a step further towards simplify the code base and make it generally more modular. This will help us focussing on performance improvement for the Hashed format and modern features for the Darcs client without having to think about OF maintenance.

Should it concern you?

Switching to Hashed repositories has a drawback: Darcs 1 binaries do not know how to interact with them. If you manage a project whose source code is hosted in a Darcs repository, then you should ensure that all contributors use a Darcs 2 binary (darcs --version).

If you or someone in your project use a Darcs 1 binary, you should check whether it is possible for you to upgrade to Darcs 2.0 or greater. Please consult http://wiki.darcs.net/Binaries for more information and http://wiki.darcs.net/DarcsTwo for converting from OF to Hashed.

For the record, Darcs 2.0 was released in April 2008 and is now adopted in most current versions of the major Linux distributions. The release of Darcs 2.8 is planned for May 2011, so no problematic situation due to OF deprecation will happen before then.

    Consolidating the move

    We know that a few shortcomings remain with hashed repositories. For instance darcs add is noticeably slower on big Hashed repositories: http://bugs.darcs.net/issue1938 . We will try to address these issues by the release of 2.8. We are very interested in knowing whether there are other issues, so please let us know on the bug tracker or by commenting this post.

    Okay, we talked about a "un-feature", but Darcs 2.8 will also contain a few new features that will probably please a lot of you. This will be for another blog post very soon.

    1 comment:

    Angel17 said...

    This is a very interesting topic. I will wait for the other issues. chaffeyroofingontario.com

    Followers