Thursday, November 20, 2008

camp irregular news #1

Reposting a recent darcs-users update from Ian (with some very slight adjustments on my part for the conversion to HTML)




Hi all,

I haven't written about camp in some time, and a lot has happened, so I figure I should send an e-mail. So, here's the first edition of the "Camp Irregular News", if you will :-)

Mailing list


Camp now has a mailing list. I'll probably continue to send things of more general interest to the darcs list, but camp-specific stuff will generally go to the camp list. For details, please see http://projects.haskell.org/camp/contact

Bug tracker


But the main reason that camp has acquired a mailing list is that camp also now has a bug tracker and I wanted somewhere for the ticket change messages to go. Fow now, this is really just a TODO list, with the major missing pieces listed.

Development


And some real work, too. At and around the sprint, I:
  • Implemented "chunky" hunks, which mean that we don't need to break a file up into lines and then join it back together again when applying hunk patches
  • Implemented primitive interactive patch selection. It's nothing fancy, but it makes it easier to work with than the all-or-nothing record that camp had before
  • General improvement, e.g. there is now a repository type, rather than just misusing FilePath
  • Worked out how to pkg-config, libcurl and Cabal to play nicely on Windows/MSYS/mingw
  • Made a darcs2camp tool
  • Implemented the "get" command

darcs2camp

darcs2camp is currently fiddly to build, as it needs to be linked against some of darcs's sources. In the near future it will either use libdarcs, or I'll fork a copy of darcs and wibble it until it just builds darcs2camp.

Due to working on each primitive patch separately, darcs2camp isn't the fastest beast in the world; on the 19766 megapatch (359470 primitive patches) GHC repo it takes me 1 hour 47 mins to convert from darcs to camp format. Then again, the original git conversion took 3 days, so it could be worse! And it shows a patches-converted count to keep you entertained.

The disk usage for darcs's patches directory is
disk usage115M
actual number of bytes49M
actual number of bytes when uncompressed204M


Meanwhile, camp's patch file weighs in at 214M (which is both the actual number of bytes and the disk usage, as it's all in one file). There are a number of things going on here:

  • camp currently doesn't store any meta-data, so it should be a little more than 214M.
  • currently, if we store the primitive patch "name-3" inside the patch "name" then we store the string "name-3" even though we don't have to.
  • We could easily compress individual patches. Presumably if we did this with gzip then we'd get down to about 50M.
  • With a little work we could compress clumps of patches. However, gzipping the whole file only gets us down to 46M, so there is little to be gained there. bzip2ing the whole file gets us down to 38M.

"get"ing repos

And that means we can do timings etc for large repos easily.

Some timings for get and the ghc repo:
  • With darcs 1.0.9rc1, get takes around 5.5 seconds. However, I believe it's copying the pristine directory rather than actually applying the patches, which isn't safe if you can't lock the repo. However, "darcs check" takes 1 minute 45 seconds, and that does essentially the same work that "get" is supposed to
  • with darcs 2.1.0, get takes 1 minute 29 seconds (and looks like it's behaving safely)
  • with camp, get takes 1 minute 37 seconds


I haven't looked at optimising get with camp yet, but one thing that should definitely make a big difference is batching up multiple changes to a single file. It is common to get a megapatch which contains a sequence of n patches which change a hunk the same file. When applying such a megapatch, camp currently reads and writes the whole file n times, which obviously isn't optimal! IIRC that made a significant difference when we added it to darcs, and I expect it will for camp too.

camp is also cheating slightly, as it doesn't do a syntactic-validity check of the patches it is given before applying them. This means that it'll fail less prettily than it ought to. However, I'm not sure if darcs also cheats, and I don't expect that it will make much difference to the time taken anyway.

Camp's space usage while "get"ing is currently higher than it should be because of http://hackage.haskell.org/trac/ghc/ticket/2762 so I can't get good figures for that at the moment.

What next?


The above is mostly development stuff, mainly due to being at the sprint. I plan to focus more on theory stuff next. As you may have seen on the darcs list, I've started thinking about conflict marking, and I also have some patch theory proofs in my head that I need to get written down in the paper.

Thanks
Ian

No comments:

Followers