pages tagged gitDon Armstronghttps://www.donarmstrong.com/tags/git/Don Armstrongikiwiki2012-08-09T21:49:39ZMigrating from Subversion to git with git-annexhttps://www.donarmstrong.com/posts/migrating_from_svn_to_git_and_git_annex/2012-08-09T21:49:39Z2012-08-09T21:49:39Z
<p>Recently, I've started converting many of my subversion repositories
to git, some of which contain fairly large files (2-3G). However, git
can be slow to deal with repositories with large files, and it also
isn't able to selectively discard unneeded files when disk space is
pressing. Thankfully,
<a href="https://www.google.com/search?q=git-annex">git-annex</a> resolves most
of these problems with git, but the process required to use git-annex
on a converted subversion repository is slightly complicated.</p>
<h2 id="basicconversionofsvntogit">Basic conversion of svn to git</h2>
<p>The basic conversion of svn to git is done using git-svn:</p>
<pre><code> git svn clone file:///srv/svn/foo --no-metadata -A authors.txt -T trunk foo
</code></pre>
<p>where /srv/svn/foo is the subversion repository, authors.txt is a list
of <code>login = Full Name <email@example.com></code> pairs matching each of the
subversion commit authors, and foo is the git repository to create.</p>
<p>git-svn has a ton of useful options, but the basic invocation above is
all I'm concerned with.</p>
<h2 id="migratinglargefilesfromgitintogit-annex">Migrating large files from git into git-annex</h2>
<p>In order to migrate from git to a git+git-annex setup, we'll have to
walk the entire commit history, and edit each commit to instead store
large files in git-annex, replacing the large file with a symlink, and
finally eliminate all of the references to the old large objects, and
do garbage collection.</p>
<p>Because we may have the same file move around, we're going to use the
git-annex SHA1 backend instead of the default WORM backend which is
based on filename and size, and init git-annex.</p>
<pre><code> cd foo; echo '* annex.backend=SHA1' > .git/info/attributes
git annex init
</code></pre>
<p>Then, we're going to filter out the large files using <code>git
filter-branch</code>. To do that, we'll first, we'll create a little helper
script <code>git_annex_add.sh</code>, which will remove the file from the git
repository, add to git annex, and fix up the symlinks:</p>
<pre><code> #!/bin/bash
f="$1";
git rm --cached "${f}";
git annex add "${f}";
annexdest="$(/bin/readlink -v ${f})";
ln -sf "${annexdest#../../}" "${f}";
echo -n "Added: "
ls -l "${f}";
</code></pre>
<p>Then we will run filter-branch, and annex all files larger than 5
megabytes.
[Tweak the find command if you want to do something different.]</p>
<pre><code> git filter-branch --tag-name-filter cat --tree-filter \
'find . -ipath \*.git\* -prune -o -path \*.temp\* -prune -o -size +5M -type f -print0|xargs -0 -r -n1 ~/git_annex_add.sh;
git reset HEAD .git-rewrite; :' -- master
</code></pre>
<p>This operation will take a while.
[It would be better to do this during the initial svn→git conversion, but since that requires more knowledge of git-svn, svn, git, and git-annex internals than I have, and I only have to do this once for each repository, it's not worth my time.]</p>
<p>Now we have successfully switched everything to using git-annex, and
we need to clean out the old references to the files:</p>
<pre><code> rm .git/svn -rf;
rm -rf .git/refs/original .git/refs/remote/trunk .git/refs/remote/git-svn;
git reflog expire --expire=now --all
git gc --prune=now
git gc --prune=now --aggressive
</code></pre>
<p>(I'm not sure if the last two commands need to be separate; I'm cargo
culting a bit there.)</p>
<h2 id="storingallgit-annexfilesinaremoterepository">Storing all git-annex files in a remote repository</h2>
<p>Because git-annex allows you to easily throw away files which are no
longer referred to by the tip of any branch using git annex unneeded
(and because I'd like all of the files on my central remote
repository), I'm going to shove all of the git annex files into the
remote bare repository. Normally, you would use <code>git annex copy
--to=remote;</code> to do this, but because that only copies needed files,
not everything, we'll have to do it manually.</p>
<p>First, create the remote repository:</p>
<pre><code> git init --bare /srv/git/foo.git
cd /srv/git/foo.git; git annex init foo.example.com
</code></pre>
<p>Add the remote to the local repository, push to the remote, and sync
the objects and sync the annex:</p>
<pre><code> git remote add origin ssh://foo.example.com/srv/git/foo.git
git push origin master
rsync -avP .git/annex/objects ssh://foo.example.com/srv/git/foo.git/annex/.;
git annex sync
</code></pre>
<p>Finally, on the remote, run <code>git annex fsck</code> to clean up the links to
the imported objects:</p>
<pre><code> cd /srv/git/foo.git; git annex fsck;
</code></pre>
<h2 id="unresolvedissues">Unresolved issues</h2>
<p>I don't know if the above works properly for branches. I suspect that
it does not. I also have not exhaustively tested this methodology to
verify that all of the history is present in every case. But hopefully
this post (or some modification of it) will be helpful to you.</p>
<h2 id="credit">Credit</h2>
<p>Many of the methodologies described here I originally found in
<a href="http://git-annex.branchable.com/forum/migrate_existing_git_repository_to_git-annex/">tyger's git-annex forum post</a>,
the <code>git gc</code> stuff came from random google searches about shrinking
git repositories, and the rsync suggestion came from joeyh (author of
git-annex) and the other helpful denizens of #vcs-home on
irc.oftc.net.</p>