It would be unfair to teach an advanced bioinformatics class without getting into the weeds of command line interfaces (CLIs), heading, tailing, greping, and wc-ing files, piping programs together, and running a bioinformatics program and working with its output. Hence, the final component of the immuno-bioinformatics class that we [Digital World Biology] are developing for Shoreline Community College will use cloud computing in the CyVerse environment to learn some Unix and run the bioinformatics program IgBLAST on a publicly available set of immunoprofiling data.
As noted in previous blogs, we’d like to have students gain some experience working as bioinformaticians. While there are many web-servers and software applications available for analyzing DNA sequence data without writing scripts or running the core programs, it is worthwhile for students to experience the power of Unix (Linux) and use a CLI to run a bioinformatics program and work with its inputs and outputs.
And, those who enjoy the experience can further develop their data science skills. As they enter the workforce they may find those skills to be in demand and find opportunities for high paying jobs. Moreover, being able to explain how data are processed, strengthens students' understanding of how web-based data analysis systems and software applications work, which will be an asset in any modern biotechnology job.
Previously, I discussed my adventures associated with preparing the exercise. To work in the time fame of class periods, we need a reasonable example dataset that is small enough to be processed in short periods of time and, yet illustrate the core elements of immunoprofiling. The goal of that adventure was to create a Virtual Machine (VM) that is preloaded with sequence analysis tools, an immunoprofiling dataset, IgBLAST, and the appropriate reference data so that we can focus on working with data instead of installing software. During the class, students will create and work in their own copies of instances launched from this preloaded VM.
For cloud computing, we will use the National Science Foundation (NSF) funded CyVerse cyberinfrastructure. CyVerse provides a cloud computing platform via its Atmosphere environment and is free for academics and students. Before the class exercise, students will obtain CyVerse and Atmosphere accounts. Using their accounts, they will login into CyVerse, launch Atmosphere, launch their instance, and then login to their instance via Atmosphere’s web shell and desktop interfaces via their web-browsers.
In terms of basic Unix, students will use commands to explore their environments and learn how to review data in large files. These commands will be complemented with a few bioinformatics programs that work specifically with DNA sequences. They will view raw FASTQ and FASTA formatted sequence data and run commands to obtain subsets of data for their analyses. We will also look at data quality on pre and post filtered data. As the focus of the class is immuno-bioinformatics, examining data quality will be a minor component of the exercise, so that there will be more time to run IgBLAST and work with results.
As designed thus far, the core bioinformatics compliments the immunoprofiling component that begins the course. We will examine a publicly available dataset from a study of human vaccine response. Unlike the Adaptive Biotechnologies (Seattle WA) method, where PCR primers to V and J gene regions are used amplify DNA, or RNA, these data were obtained from an RNA-Seq experiment where cDNA was PCR amplified using primers to 5' adaptor and C gene regions.
As this is a class exercise, three questions quickly pop to mind:
- How will the reads differ in terms of length and content between the Adaptive Biotechnologies method and the method used for this dataset?
- Are their any limitations to using C region primers in PCR? Hint, does it work equally well for both RNA and DNA?
- How well will the two methods compare in terms of quantifying receptor sequences?
Clues to answering the above questions can be found in “Immunoprofiling: How it works.”
Following a brief Unix introduction we will run IgBLAST on subsets of data (1000 or 5000 reads) randomly selected from the master file. From the IgBLAST directory the following command (and variations for different input files) will be entered:
> igblastn -germline_db_V database/human_V -germline_db_D database/human_D \ -germline_db_J database/human_J -organism human -query SRR4431764.clean.1000a.fasta \ -auxiliary_data optional_file/human_gl.aux -show_translation \ -outfmt 19 > SRR4431764.clean.1000a.blast.tsv
The output format (-ourfmt) 19 writes data to a tab separated file. Hence, the portion of the command, "-outfmt 19 > SRR4431764.clean.1000a.blast.tsv," will write that data to the specified “.tsv” file. Next students will use cat, less, and wc to explore the file’s contents. They will see that the data viewed in this this way it reasonably incomprehensible because it is a 1001, or 5001, line table (one header, followed by many lines of data) with over 60 columns - all squished and line wrapped into the confines of whatever terminal is used for accessing the VM. So, what can be done now?
Getting the data
The last step is to get the data in the tsv file from the CyVerse Cloud into the Google "Cloud." In this way we can review and work with the results in Google Sheets (Microsoft Excel works too). A challenge in any kind of computing that involves more than one computer is moving data between computers. We won't go into details here, but in the case of cloud computing the challenges can increase. Fortunately, cloud computing and sharing data between computers is getting easier - think Google Drive, DropBox, Box, and the many other file sharing tools that now exist.
With Atmosphere’s web shell and desktop interfaces one can access a running instance's file system by typing control-alt-shift. This key combination will open an overlay window that provides an additional interface to add text and access the VM’s file system (directories). With the directory window open, we can double click on the appropriate directories to open them and navigate to the /home/igBLAST directory, and then double click on the desired file to retrieve it to our local computer.
The next step is to add that file to Google Drive (which requires a Google Account) and open it in Google Sheets (Sheets). Because we directed the BLAST output to a file with a ".tsv" extension, it will import into Sheets. The first line will become the column headers and the remaining lines will fill the cells under each column. In essence the spreadsheet software has imported our tab structured data into a database of sorts because the column headings can be used to develop queries and mine the data.
Spreadsheet programs like Sheets and Excel provide "Pivot" tools to explore and mine data in different ways. Google Sheets takes an extra step and provides an "Explore" button that shows examples of what it [Google] thinks might be good questions or things you can do with the data. In most cases, especially in biology, these will be nonsensical, but they illustrate possibilities and look nice.
In Pivot analyses, columns headings form the keys for queries that retrieve data, which can be counted and summarized different ways. For example, in antigen receptor formation the DNA rearrangement process results in sequences that can produce non-function receptors due to reading frame errors. We can detect these in analysis, so one column in our table is labeled "productive." The data under productive are T, F, or blank for productive, not productive, and not determined, respectively. Counting the number of cells with T, F, or blank under productive is one kind of simple query.
Pivot analyses are powerful and quick, but also limited to small scale projects and single tables. In the example of immunoprofiling this method will not work to compare data between samples, nor will it scale to support the hundreds of thousands or millions of rows needed for each sample. Yet. this exercise is useful because it illustrates the steps involved in analyzing immunoprofiling data, and pivot analysis is useful and thus worth knowing about.