GSoC @ Git Final Report
My GSoC project: More Sparse Index Integrations
My proposal: GSoC 2022 Git Contributor Proposal
Previous blogs:
GSoC Week 1
GSoC Week 2
GSoC Week 3 - 4
GSoC Week 5 - 6
GSoC Week 7 - 8
GSoC Week 9
Summary
During my GSoC project, I was trying to integrate a series of commands with
sparse index. There are three commands that I’ve been working on:
git-mv
, git-rm
, and git-grep
. I’ll explain what the situation was,
what I’ve done, and what is the result for each command.
git-mv
-
Series “out-to-in” (in branch ‘master’)
GitHub branch
Mailing list
mv: add check_dir_in_index() and solve general dir check issue
mv: use flags mode for update_mode
mv: check if<destination>
exists in index to handle overwriting
mv: check if out-of-cone file exists in index with SKIP_WORKTREE bit
mv: decouple if/else-if checks using goto
mv: update sparsity after moving from out-of-cone to in-cone
t1092: mv directory from out-of-cone to in-cone
t7002: add tests for moving out-of-cone file/directory\ -
Series “in-to-out” (in branch ‘next’)
GitHub branch
Mailing list
mv: check overwrite for in-to-out move
advice.h: add advise_on_moving_dirty_path()
mv: cleanup empty WORKING_DIRECTORY
mv: from in-cone to out-of-cone
mv: remove BOTH from enum update_mode
mv: check if<destination>
is a SKIP_WORKTREE_DIR
mv: free the with_slash in check_dir_in_index()
mv: rename check_dir_in_index() to empty_dir_has_sparse_contents()
t7002: add tests for moving from in-cone to out-of-cone
git-rm
- Series “sy/sparse-rm” (in branch ‘master’)
GitHub branch
Mailing list
rm: integrate with sparse-index
rm: expand the index only when necessary
pathspec.h: move pathspec_needs_expanded_index() from reset.c to here
t1092: add tests for `git-rm
git-grep
- Series “sy/sparse-grep” (in branch ‘seen’)
GitHub branch
Mailing list
builtin/grep.c: walking tree instead of expanding index with –sparse
builtin/grep.c: integrate with sparse index
builtin/grep.c: add –sparse option
Notice that this report is trying to be brief, as all the information is better preserved in the corresponding commit messages. View the commits and their messages for better context and analysis.
git-mv
Situation
When I was trying to integrate git-mv
with sparse index, I noticed some
interesting behaviors playing with the command. I realized that when moving
a file across the cone boundary, namely moving from inside of cone to outside
of cone, or vice versa, Git did not check whether the resulting file complies
with the sparse-checkout definition. For example, in command git mv <f1> <f2>
,
if <f1>
is out-of-cone and <f2>
is in-cone, Git should check out the
resulting file. In conclusion, the compatibility between git-mv
and
sparse-checkout was not ideal.
What I’ve done
git-mv
from out-of-cone to in-cone
Before this series, running git mv <f1> <f2>
, with <f1>
being out-of-cone and
<f2>
being in-cone, Git errored because it cannot locate <f1>
in the working
tree.
Changes are made so that when <f1>
(either a regular file or a directory)
is out-of-cone, and when --sparse
switch is used, Git checks the index to see
if <f1>
exists and move it to the destination in the index, and check out the
file/directory to the worktree.
git-mv
from in-cone to out-of-cone
Before this series, running git mv <f1> <f2>
, with <f1>
being in-cone and
<f2>
being out-of-cone, was not possible, mainly because <f2>
is a
destination that does not exist in the worktree.
Changes are made so we can make such move. When such move is a clean move, i.e.
<f1>
has no unstaged changes, Git now moves <f1>
to <f2>
in the index,
then deletes <f1>
from the worktree and turns on its CE_SKIP_WORKTREE bit (as it
is now out-of-cone). A dirty move should move <f1>
to <f2>
, both in the
worktree and the index, but should not remove the resulting path from the
working tree and should not turn on its CE_SKIP_WORKTREE bit.`
Results
git-mv
can now handle out-to-in and in-to-out moves and is more compatible
with sparse-checkout.
There are still left-over-bits for git-mv
:
- out-to-out move.
What happens when both <f1>
and <f2>
are out-of-cone locations?
I honestly don’t know, but this is the next question I’ll dig.
- sparse index integration.
After the out-to-out move, we can try to integrate git-mv
with sparse index.
It’ll be interesting to see how the new across cone logics react with sparse
index.
git-rm
Situation
git-rm
was relatively compatible with sparse-checkout when I saw it. So the
integration with sparse index was started without much hindrance.
What I’ve done
Before my patches, git-rm
expands a sparse index unconditionally, which is
expensive. I reused an existing method pathspec_needs_expanded_index()
to
determine if a pathspec needs a full index. Now, when the pathspec does not
match things outside of sparse-checkout definition, Git does not expand the
index unnecessarily.
Because pathspec_needs_expanded_index()
seems to be a good candidate to be
reused elsewhere, I extracted it to be a public method for future usage.
After testing, I turned off the command_requires_full_index
for git-rm
,
marking this command as compatible with sparse index.
Result
The p2000
tests demonstrate a ~92% execution time reduction for
git rm
using a sparse index.
git-grep
Situation
Unlike previous commands, git-grep
does not have a --sparse
option at
the time. Also, unlike git-mv
or git-rm
, git-grep
does not write to the
index or worktree, it simply retrieves information from the repo. Like git-rm
,
git-grep
seems to be practically compatible with sparse-checkout, so we can
start the sparse index integration promptly.
What I’ve done
The first thing I did is to add a --sparse
option to the command, so we can
decide when to grep “sparse contents”.
After adding the new --sparse
option, I added more tests to make sure the
sparse expands only when using --sparse
and the behavior is as expected.
Using --sparse
to switch on/off the sparse index is an easy optimization.
When using --sparse
, the index expands as before this integration, which as
noted by my mentor, could be further optimized by implementing a custom
tree-walking logic.
I implemented a tree-walking logic to walk trees according to a given pathspec, so Git is faster when the pathspec encloses an area smaller than the HEAD tree.
Result
Without --sparse
, the integration demonstrates a ~99.4% execution time
reduction for git grep
using a sparse index.
Tree-walking logic also adds some speed (depending on the pathspec) to the
operation. For example, running git grep --cached --sparse bogus --"f2/f1/f1/*"
in the p2000 test setup repo gives a ~71% execution time reduction using this
tree-walking logic.
Closing remarks
I was going to write about what I’ve learned along the way, but then I realized that there are too many to be listed. I didn’t write a single line of C or Bash before GSoC, and now I can gradually catch up with some of the work in one of the most influential communities.
If I have to put something down, I’ll say, always try to learn actively and work collaboratively.
During these four months, every day of working added something new to my knowledge base. Learning the codebase of Git and the tech stack used is like drinking from a fire hose for me. Every change to be made requires researching, referencing, and asking myriad questions. So I realized that active learning is a key to understanding the work and keep making progress.
In a community, communication and collaboration can be more important than grinding hard problems and being a hero. Discussing with fellow developers and exchanging ideas are very crucial to deliver satisfactory work. More than often, good or even great ideas come from bringing everybody’s ideas together. When I’m stuck on the work, I learned to ask for help from my mentors more actively, instead of hitting against hard rock and getting nowhere.
More importantly, I’ll maintain my passion for open source technologies and be active as a young contributor. I’ll also try to be helpful to the community and people who are also interested in the open source world.
Thank you, my mentors, Victoria and Derrick. Thank you, Git and the community. Thank you, GSoC.
Thanks, Shaoxuan Yuan