Even though GitHub tries to provide enough storage for Git repositories, it imposes limits on file and repository sizes to ensure that repositories are easy to work with and maintain, as well as to ensure that the platform keeps running smoothly.
Individual files added via the browser IDE are restricted to a file size of 25 MB, while those added via the command line are restricted to 100 MB. Beyond that, GitHub will start to block pushes. Individual repositories, on the other hand, are capped to a maximum of 5 GB.
In this article, we’ll go over situations that can contribute to large repositories and consider possible workarounds—such as Git Large File Storage (LFS).
The Root of Large Repositories
Let’s cover a few common activities that can result in particularly large Git files or repositories.
Backing Up Database Dumps
Database dumps are usually formatted as large SQL files containing a major output of data that can be used to either replicate or back up a database. Developers upload database dumps alongside their project code to Git and GitHub for two reasons:
To keep the state of data and code in sync
To enable other developers who clone the project to easily replicate the data for that point in time
This is not recommended, as it could cause a lot of problems. GitHub advises using storage tools like Dropbox instead.
External Dependencies
Developers usually use package managers like Bundler, Node Package Manager (npm), or Maven to manage external project dependencies or packages.
But mistakes happen every day, so a developer could forget to gitignore
such modules and accidentally commit them to Git history, which would bloat the total size of the repository.
Other Large Files
Aside from database dumps and external dependencies, other types of files can contribute to bloating up a repository file size:
Large media assets: Avoid storing large media assets in Git. Consider using Git LFS (see below for more details) or Git Annex, which allows you to version your media assets in Git while storing them outside your repository.
File archives or compressed files: Different versions of such files don’t delta well against each other, so Git can’t store them efficiently. It would be better to store the individual files in your repository or store the archive elsewhere.
Generated files (such as compiler output or JAR files): It would be better to regenerate them when necessary or store them in a package registry or even a file server.
Log and binary files: Distributing compiled code and prepackaged releases of log or binary files within your repository can bloat it up quickly.
Solution 1: Remove Large Files from Repository History
If you find that a file is too large, one of the short-term solutions would be to remove it from your repository. git-sizer is a tool that can help with this. It’s a repository analyzer that computes size-related statistics about a repository. But simply deleting the file is not enough. You have to also remove it from the repository’s history.
A repository’s history is a record of the state of the files and folders in the repository at different times when a commit was made.
As long as a file has been committed to Git/GitHub, simply deleting it and making another commit won’t work. This is because when you push something to Git/GitHub, they keep track of every commit to allow you to roll back to any place in your history. For this reason, if you make a series of commits that adds and then deletes a large file, Git/GitHub will still store the large file, so you can roll back to it.
What you need to do is amend the history to make it seem to Git/GitHub that you never added the large file in the first place.
If the file was just added in your last commit before the attempted push, you’re in luck. You can simply remove the file with the following command:
git rm --cached csv_building_damage_assessment.csv (removes file)
But if the file was added in an earlier commit, the process will be a bit longer. You can either use the BFG Repo-Cleaner or you can run git rebase or git filter-branch to remove the file.
Solution 2: Creating Releases to Package Software
As mentioned earlier, one of the ways that repos can get bloated is by distributing compiled code and prepackaged releases within your repository.
Some projects require distributing large files, such as binaries or installers, in addition to distributing source code. If this is the case, instead of committing them as part of the source code, you can create releases on GitHub. Releases allow you to package software release notes and links to binary files for other people to use. Be aware that each file included in a release must be under 2 GB.
Solution 3: Version Large Files With Git LFS
The previous solutions have focused on how to avoid committing a large file or removing it from your repository. What if you want to keep it? Say you’re trying to commit psd.csv
, and you get the too large file error. That’s where Git LFS comes to the rescue.
Git LFS lets you push files that are larger than the storage limit to GitHub. It does this by storing references to the file in the repository, but not the actual file. In other words, Git LFS creates a pointer file that acts as a reference to the actual file, which will be stored somewhere else. This pointer file will be managed by GitHub and whenever you clone the repository down, GitHub will use the pointer file as a map to go and find the large file for you.
Git LFS is ideal for managing large files such as audio samples, videos, datasets, and graphics.
To get started with Git LFS, download the version that matches your device’s OS here
Set up Git LFS for your account by running
git lfs install
Select the file types that you want Git LFS to manage using the command
git lfs track "*.file extension or filename"
. This will create a .gitattributes
file.Add the
.gitattributes
file staging area using the commandgit add .gitattributes
.Commit and push just as you normally would.
Please note that the above method will work only for the files that were not previously tracked by Git. If you already have a repository with large files tracked by Git, you need to migrate your files from Git tracking to git-lfs
tracking. Simply run the following command:
git lfs migrate import --include="<files to be tracked>"
With Git LFS now enabled, you’ll be able to fetch, modify, and push large files. However, If collaborators on your repository don’t have Git LFS installed and set up, they won’t have access to those files. Whenever they clone your repository, they’ll only be able to fetch the pointer files.
To get things working properly, they need to download Git LFS and clone the repo, just like they would any other repo. Then to get the latest files on Git LFS from GitHub, run:
git lfs fetch origin master
Conclusion
GitHub does not work well with large files but with Git LFS, that can be circumvented. However, before you make any of these sensitive changes, like removing files from Git/GitHub history, it would be wise to back up that GitHub repository first. One wrong command and files could be permanently lost in an instant.
When you back up your repositories with a tool like BackHub (now part of Rewind), you can easily restore backups directly to your GitHub or clone to your local machine if anything should go wrong.