Go is a new, open source programming language "that makes it easy to build simple, reliable, and efficient software" with excellent concurrency primitives. The same strengths that have made Go a Java/C++ replacement at Google make it an excellent choice for modern scientific programming.

To demonstrate how effective Go is at general-purpose scientific programming, we will be using a Python package called find_circ as a case study that I have adapted from Memczak et al. 2013, and is available under the GPL. I have been using this package to investigate the role of circular RNAs in plants and most recently in mutagenic cell lines, but frequently bumped into the limitations of Python in the scientific world, particularly for replication and speed. To start out, we will compare the entry point of the application: the command line interface.

Parsing Command Line Arguments

In Python, there are two packages in the standard library for parsing command line arguments: optparse and argparse. Although optparse has been deprecated, many Python utilities still use it for the CLI, including find_circ. These libraries are a huge step above manually parsing command line arguments. Go on the other hand provides the flag package in the standard library. Both programming languages have numerous 3rd party packages that offer command line parsing, but are used sparingly.

Adding CLI flags

The fundamental unit of passing arguments into a command line program is a flag or option. Nearly all CLI programs have a convention of having an explicit long-form of the option behind double-dashes (ex: --anchor), and sometimes offering a short, single character option behind a single dash (ex: -a). Short options can be combined, so -ab is equivalent to -a -b.

In our find_circ program, we need to have an option to get the anchor size desired, with a sane default of 20. In Python with optparse, we can do this by initializing a parser and adding an option to it with the add_option method.

from optparse import OptionParser

parser = OptionParser()
parser.add_option("-a", "--anchor", dest="asize", type=int,
                  default=20, help="anchor size (default=20)")

This is fairly simple and idiomatic, though as you can see there are some oddities that we'd like to avoid.

  1. Although the default is defined as a keyword argument, we have to manually show the default in the usage string.
  2. dest uses a string for the desired variable name rather than a symbol, so most syntax highlighting and IDE features will not apply here.

Let's see how Go compares in this case. Right off the bat, there's a huge difference: Go's flag package only supports long options, but are behind a single dash (ex: -anchor). This is extremely inconsistent with the conventions used in modern GNU/BSD systems, which may take some getting used to. The plus side is that using the CLI is more consistent and readable, which also makes it easier to understand. Now, to the code:

import "flag"

var anchor = flag.Int("anchor", 20, "anchor size")

That was easy! Notice that Go does not have keyword arguments, but the function arguments are provided in the documentation and any good text editor. There is much less duplication here: the result is saved to a variable anchor and the default is automatically appended to the usage string.

Customizing Usage

A huge issue that needs to be addressed in command-line programs is the user experience, or UX. Many CLI programs are difficult to use and understand, especially one-off scientific scripts that are undocumented.

Luckily, both optparse and flag provide us with a default help message that is output with the -h flag or when the validation fails. However, often this message needs to be customized further. First, let's start with the default Go usage message.

Usage of find_circ:
  -anchor int
    	anchor size (default 20)
  ...

This is quite handy: each flag is given with the desired type, as well as the default. However, most bioinformatics programs are intended to be used in a larger pipeline, find_circ included. Thus, a customized usage string is a huge boon to help new users learn how to actually use the program. In the original Python version of find_circ, the usage string is customized through the usage keyword argument of OptionParser, as you can see below.

usage = """

  bowtie2 anchors.qfa.gz | %prog > candidates.bed 2> candidates.reads

"""

parser = OptionParser(usage=usage)

Let's try to work this into our Go program. In Go, the flag package is never explicitly initialized, but it has a flag.Usage variable that can be customized in the init function, which is evaluated upon runtime after all the variable initalizations. Thus, we can emulate the exact behavior of the Python program through a few clever print statements as below:

flag.Usage = func() {
	fmt.Fprintln(os.Stderr, "Usage:")
	fmt.Fprintf(os.Stderr, "  bowtie2 anchors.qfa.gz | %s > candidates.bed 2> candidates.reads\n\n", os.Args[0])
	fmt.Fprintln(os.Stderr, "Options:")
	flag.PrintDefaults()
}

Now, the output of find_circ -h is our customized usage message.

Usage:
  bowtie2 anchors.qfa.gz | ./find_circ > candidates.bed 2> candidates.reads

Options:
  -anchor int
    	anchor size (default 20)

It is hard to decide which approach is preferred in this case: Python does the right thing with the input string, and is far easier to implement. However, the Go program is much more explicit and can be customized even further to add additional sections of documentation in the usage message.

Alternatives

In case the Go flag package is insufficient for your needs, there are countless Go packages that parse command line arguments (yes, each word is a link to a CLI package). I personally have never needed to use any CLI libraries so I cannot recommend one over the other, but this is especially useful if backward compatibility is needed.

Compiling

One advantage and disadvantage of Go in comparison with interpreted languages like R, Python, MATLAB, etc is that it is compiled. Luckily, the Go tool features a go run command that makes it easy to run .go source files easily without needing to compile each time, but the final project should be compiled using go build or go install to take advantage of all that the compiler offers.

The result of go build or go install is a statically linked binary that can be copied over to any machine with the same architecture and immediately run without installing Go or any other libraries - it is completely self-contained. Furthermore, building for other architectures (such as developing on a Windows machine for the Linux server cluster) is extremely simple with the go toolchain. This cross-compilation process is not needed in Python since it is interpreted, but one needs to install Python on the machines and maintain version compatibility across the board.

Conclusion

One misconception of compiled, statically typed languages such as Go is that they must be more verbose than Python, R, Perl, or Matlab, which are frequently used in bioinformatics. However, I have found that for most CLI programs, I am able to create concise programs in Go with similar levels of clarity as Python without any sacrifices to performance. This is the first of a series of articles on using Go as a data science programming language. For this first round, it is clear that Go is the winner: the syntax is cleaner and easier, and having the option of running in-place or compiling a static binary is a definite win over every other language compared.