There are problems that require specific solutions, and problems that require generalizable solutions. This article is aimed at the latter, using python as an example. When building out large systems, it’s important to keep in mind the DRY principle — Do not Repeat Yourself! Repetition in a code base can turn a simple change into a tangled mess befuddled by human errors — hence, spaghetti code. One way to reduce repetition in a single code base is through modularizing functionality — by decomposing repetitious code into modular functions, an update to a single function can replace the arduous task of updating code in distant corners of a codebase. But what if you’re building something bigger, and shared code exists beyond a single codebase?
Enter, packages. Every major programming language has some mechanism through which code can be shared and used by different people — Ruby has Gems, Node has npm, and Python has pypi. Packages help reduce the effort required to compose software by modularizing pre-built functionality that can be imported and used in any codebase. If you wanted to write a python program that can graph a dataset, you certainly could write your own library to help facilitate this functionality, but this might not be the best use of your time. Graphing is a common problem, and common problems often already have solutions — the path of least resistance would be to import an existing library (such as matplotlib), and use that for your graphing needs (unless, of course, you’re needs require a level of complexity and sophistication that existing packages can’t provide, in which case a more custom solution is required).
Writing Custom Modules
There are many reasons to write custom modules — you might want to package some functionality that would be useful in code across your organization’s codebases, you might want to open source a solution that worked well for a particular problem you solved, or you might just be interested in how programmers share code. Whatever the case, mature languages will almost certainly have the facilities to share code. In fact, I’d argue that any moderately useful language should have some type of package management solution — it’s impossible to efficiently write code if every time you want to do something as common as print to
stdout, you have to write the functionality to support the use case yourself. Of course, this is coupled with the question of how extensive should a language's standard library be — that's a discussion beyond the scope of this article, but I point the curious reader to Guy Steele's fantastic talk on growing a language.
Exploring Package Management in Python
Note: I use the term package and module interchangeably. Semantically speaking, a package is a collection of modules, whereas a module is a single file. For the sake of simplicity, I illustrate examples of a module, but the same applies for packages in general.
With sufficient motivation for shared packages, we can now look into the tools within python that make this magic possible.
pip is the utility that allows you to install packages in python — there's no shared consensus over what it actually stands for ("PIP installs python"? "PIP installs packages?", or some other uninspired variation). Regardless of what it stands for, usage is incredibly simple. In our above example using
matplotlib, we can install the package as simply as running:
Note: Some python installations might not come with pip out-of-box. From the pip documentation:
pip is already installed if you are using Python 2 >=2.7.9 or Python 3 >=3.4 downloaded from python.org or if you are working in a Virtual Environment created by virtualenv or pyvenv. Just make sure to upgrade pip.
Now know how to install packages, but where do those packages come from?
PyPi: Python Package Index
The Python Package Index is the most popular public index of python packages. Anybody can publish to or install from PyPI. If you have shared code that is sensitive (for example, shared code to handle auth within an organization), there are excellent private options, such as JFrog Artifactory or AWS CodeArtifact.
pip defaults to using the PyPi index when installing packages, which you can confirm by observing the output from the following command:
If you’re using a private option like Artifactory, you would need to create a configuration file that knows to use Artifactory when attempting to resolve packages.
Writing a Python Module
Writing a module in python to be shared across files in a single code base is as simple as defining a function and importing it into other files. If I define the following function in
I can import this function in another file as easily as:
my_cool_function is so cool and useful that you feel the need to share it with the world — you want to publish your module to PyPI. The difficult part is writing the module — actually publishing it is seamless once you have the proper infrastructure in place.
Publishing a Python Module
The base infrastructure for writing a module is the actual module code, and a few additional files. The basic module structure should look something like the below:
The actual source code exists in
src/. Along with the file we have for our code, we also have an
__init__.py — this file lets python know this is a module directory. This is a contrived example — in reality, your module may have many more files in the
src directory that serve as support for your core module. (Note: the
src/ directory can actually be called whatever you desire — all that matters is that you properly identify the root directory of your source files in your
setup.py, which we’ll get to in a moment.
requirements-dev.txt are standard within python projects—they contain all of your projects dependencies (the split between a regular and dev file is purely up to the programmer, but there are always benefits to extra decomposition). Think of these files as containing a list of libraries that any end user would need to install in order to properly use your library (when the module is pip installed, these dependencies are downloaded automatically by
README.md is standard across most codebases (not just python) — this should contain a description of your module and whatever other information you want to include.
Finally, the files that actually provide the backbone of your module are
MANIFEST.in specifies what additional files to include as part of your module (other than your source files), and
setup.py is the actual module configuration. The
setup.py file contains all of the metadata and information required to properly package your module for whatever index you want to upload to. An example for us might look something like this:
The above example is configure to include the
README.md file along with regular and dev dependencies with the package definition. Most of this should be self explanatory, but of note is the
name argument to the
setup function, which the actual name of the package (i.e., you would install this module by running
pip install my_cool_function. The
packages arg specifies the location of the source files, in this case,
src. The rest of the options are metadata for the package — for an extended explanation for each, check out the docs.
With the proper infrastructure in place, the final step is to actually upload and publish the package. To do so with PyPI, you need create an account at https://pypi.org/. Once you have an account, you’ll need access to your API key which ensures your uploads are secure.
To actually upload your archive, first generate the distribution archives with the following:
There are ways to debug that your package will be uploaded properly using TestPyPI, a separate index meant to facilitate the testing module uploads— the python documentation has an excellent tutorial on this exact process https://packaging.python.org/tutorials/packaging-projects/.
Now that your package is published, you can easily pip install it:
pip install my_cool_function
Maintaining a Module
When maintaining a python module (or any code, for that matter), you should automate as much of the work as possible. While we can’t really automate writing a module (at least not entirely), we can automate its integration and deployment. If you aren’t practicing rigorous Test Driven Development (TDD), the long term correctness of your code is dependent on automated testing. Discussion on effective testing is beyond the scope of this article, but there are lots of ways to test your code — my recommended solution is
In the spirit of developing shared code, I have a starter repo called
module_starter_cli that comes preconfigured with all the necessary infrastructure to automate testing via
pytest , versioning and publishing through
python-semantic-release, and deployment through CircleCI. Forking this repo for downstream module projects gives you the ability to run automated tests on every branch for every pull request, and will automatically version your code and handle the hassle of deployment to create a seamless and low friction development cycle. I can go into detail on this setup and the philosophy behind shared infrastructure code in a future article. Till then, I hope this article helps your solutions become a little more generalizable.