Recently, I've started converting many of my subversion repositories to git, some of which contain fairly large files (2-3G). However, git can be slow to deal with repositories with large files, and it also isn't able to selectively discard unneeded files when disk space is pressing. Thankfully, git-annex resolves most of these problems with git, but the process required to use git-annex on a converted subversion repository is slightly complicated.
Basic conversion of svn to git
The basic conversion of svn to git is done using git-svn:
git svn clone file:///srv/svn/foo --no-metadata -A authors.txt -T trunk foo
where /srv/svn/foo is the subversion repository, authors.txt is a list
of login = Full Name <email@example.com>
pairs matching each of the
subversion commit authors, and foo is the git repository to create.
git-svn has a ton of useful options, but the basic invocation above is all I'm concerned with.
Migrating large files from git into git-annex
In order to migrate from git to a git+git-annex setup, we'll have to walk the entire commit history, and edit each commit to instead store large files in git-annex, replacing the large file with a symlink, and finally eliminate all of the references to the old large objects, and do garbage collection.
Because we may have the same file move around, we're going to use the git-annex SHA1 backend instead of the default WORM backend which is based on filename and size, and init git-annex.
cd foo; echo '* annex.backend=SHA1' > .git/info/attributes
git annex init
Then, we're going to filter out the large files using git
filter-branch
. To do that, we'll first, we'll create a little helper
script git_annex_add.sh
, which will remove the file from the git
repository, add to git annex, and fix up the symlinks:
#!/bin/bash
f="$1";
git rm --cached "${f}";
git annex add "${f}";
annexdest="$(/bin/readlink -v ${f})";
ln -sf "${annexdest#../../}" "${f}";
echo -n "Added: "
ls -l "${f}";
Then we will run filter-branch, and annex all files larger than 5 megabytes. [Tweak the find command if you want to do something different.]
git filter-branch --tag-name-filter cat --tree-filter \
'find . -ipath \*.git\* -prune -o -path \*.temp\* -prune -o -size +5M -type f -print0|xargs -0 -r -n1 ~/git_annex_add.sh;
git reset HEAD .git-rewrite; :' -- master
This operation will take a while. [It would be better to do this during the initial svn→git conversion, but since that requires more knowledge of git-svn, svn, git, and git-annex internals than I have, and I only have to do this once for each repository, it's not worth my time.]
Now we have successfully switched everything to using git-annex, and we need to clean out the old references to the files:
rm .git/svn -rf;
rm -rf .git/refs/original .git/refs/remote/trunk .git/refs/remote/git-svn;
git reflog expire --expire=now --all
git gc --prune=now
git gc --prune=now --aggressive
(I'm not sure if the last two commands need to be separate; I'm cargo culting a bit there.)
Storing all git-annex files in a remote repository
Because git-annex allows you to easily throw away files which are no
longer referred to by the tip of any branch using git annex unneeded
(and because I'd like all of the files on my central remote
repository), I'm going to shove all of the git annex files into the
remote bare repository. Normally, you would use git annex copy
--to=remote;
to do this, but because that only copies needed files,
not everything, we'll have to do it manually.
First, create the remote repository:
git init --bare /srv/git/foo.git
cd /srv/git/foo.git; git annex init foo.example.com
Add the remote to the local repository, push to the remote, and sync the objects and sync the annex:
git remote add origin ssh://foo.example.com/srv/git/foo.git
git push origin master
rsync -avP .git/annex/objects ssh://foo.example.com/srv/git/foo.git/annex/.;
git annex sync
Finally, on the remote, run git annex fsck
to clean up the links to
the imported objects:
cd /srv/git/foo.git; git annex fsck;
Unresolved issues
I don't know if the above works properly for branches. I suspect that it does not. I also have not exhaustively tested this methodology to verify that all of the history is present in every case. But hopefully this post (or some modification of it) will be helpful to you.
Credit
Many of the methodologies described here I originally found in
tyger's git-annex forum post,
the git gc
stuff came from random google searches about shrinking
git repositories, and the rsync suggestion came from joeyh (author of
git-annex) and the other helpful denizens of #vcs-home on
irc.oftc.net.