Homework assignments


I encourage you to discuss homework assignments with each other, but you may not view other student’s assignments or share your assignment with others. When you start programming, you often think there is a single way to address a task, but that is usually not the case: there are many ways to complete these assignments, and when code has been shared or copied it is often very obvious to a more experienced eye.

Turning in your homework by email

Your homework must always be turned in with a standardized name. That name should be <nau_id>_<homework_id>.<extension>, where <nau_id> is your NAU identifier (for example, mine is jgc53), and <homework_id> and <extension> are provided on a per-assignment basis.

Unless otherwise noted, homework must be turned in by email to jc33@nau.edu before class on the day it is due.

Final Project Due: Tues May 7th, 5pm


Homework id: final; Extension: pdf; For this assignment, the file I turn in would be named jgc53_final.pdf. E-mail your files as an attachment to jc33@nau.edu.


Start Early! The QIIME DB uses a shared server for this analysis, and you never know how busy it will be.

For this assignment you will perform a meta-analysis of several microbial surveys of the indoor environment. To get more details on microbiology of the built environment, check out microBEnet. You will use the QIIME Database to perform this meta-analysis - you’ll export several initial results from the QIIME DB, including a combined OTU table, and taxonomy survey and beta diversity results. You’ll download these data either to your own computer to perform some more in-depth analyses in the QIIME Virtual Machine.

Follow these steps to perform the QIIME meta-analysis:

  1. Create an account with the QIIME Database.
  2. On logging in, choose ‘Create a New Meta-Analysis’. Name your meta-analysis and hit ‘Next’. Click ‘Perform Meta-analysis’
  3. Next, you’ll choose the studies you want to include. Choose Kelley_office_contamination (cite), Flores_restroom_surface_biogeography (cite), and CaporasoIlluminaPNAS2011_5prime (cite). Then do the following: uncheck “Show Common Fields Only”; Select Metadata Fields: “All”; Hit ‘>>’ to copy all metadata fields to your mapping file; Hit “Continue”.
  4. Select “Processing Method: Serial”; Check “Taxonomy Summary””, then expand “Optional parameters (sort_otu_table.py), and set “Category to sory by” to STUDY_TITLE; Check “Beta-Diversity”, expand “Optional Parameters (rarefaction.py), and set “Rarified at” to 500 seqs/sample, then expand “Optional Parameters (beta_diversity.py), and set “Metrics to use” to bray_curtis, weighted_unifrac, unweighted_unifrac, euclidean; choose 3d PCOA plots (no optional parameters). Click “Submit”.
  5. This may take a while to run. When it’s done, you can begin to explore the results via the database website. Download the “Zip Archive” to address the questions below.

You will turn in a maximum of a three page paper describing your analysis, including a brief introduction (describe what specific types of samples are under investigation here as well as some background on the studies the samples were derived from), description of your methods, and description of your results. Focus your results on the following questions:

  1. Do the four distance metrics you applied result in similar PCoA plots? Compare these visually, and present a distance matrix of Procrustes M^2 values between the four principal coordinate matrices generated by the QIIME DB (*_pc.txt) as a table in your paper and discuss your conclusions from this analysis. This will involve running a bunch of commands, reviewing the output of each, and using that output to compile a 4x4 distance matrix (e.g., in Excel).
  2. What does the clustering pattern in your weighted UniFrac PCoA plot tell you about the likely sources of the indoor microbial communities from the different studies (as judged by the similarity of the indoor communities to the diverse set of samples in the CaporasoIlluminaPNAS2011_5prime study)? Include the most informative view of your PCoA, with a color legend, in your paper to support this analysis.

On an additional page (in addition to your three page paper) please provide feedback on the following questions on the usability of the QIIME database. (This is absolutely required! The QIIME DB is in beta testing status right now, and in exchange for getting to use it for this project to expose you to new tools, I agreed to require you to provide input that is useful in testing of the system.)

  1. Do you have any suggestions for QIIME DB user interface improvements?
  2. Did you notice any issues with stability of the system (e.g., did anything crash or hang)?
  3. Are there any additional features that would be useful?
  4. Do you get all of the data that you wanted when downloading the zip archive, or are there additional files that should be provided?
  5. Do you feel that you sufficiently understand the methods being applied (i.e., is relevant information being provided at the right times)?


See the Procrustes tutorial here. Since your principal coordinate matrices were already generated for you by the QIIME DB, you’ll pick up the tutorial with transform_coordinate_matrices.py. Do some research on Procrustes analysis to learn how to interpret these results. (Hint: search for the PROTEST method.)


UniFrac clustering analysis: modifying the mapping file (e.g., in Excel) to create a single metadata column that is informative across all samples will help here. You would need to run make_3d_plots.py with your new mapping file to generate a new PCoA plot. I created a new column in my mapping file called Study_detail that combined the Surface and ENV_FEATURE columns.

Homework 7: Human genomics (20 points)

The purpose of this homework is to help you become familiar with two of the most common databases for human genetic information and the types of analysis that can be done on this information. All of the data that you will need for this assignment can be found on the UCSC Human Genome Browser and the 1000 Genomes Database. While all of the data you’ll need is available from these two sites, you should feel free to research a topic or question elsewhere.

Find the assignment here.


Homework id: hum_gen; Extension: ipynb, pdf; For this assignment, the files I turn in would be named jc33_hum_gen.ipynb and jc33_hum_gen.pdf. E-mail your files as an attachment to jc33@nau.edu.

Homework 6: QIIME analysis (25 points)


Homework id: qiime; Extension: biom, pdf, and ipynb; For this assignment, the files I turn in would be named jgc53_qiime_otu_table_even.biom, jgc53_qiime_paper.pdf and jgc53_qiime_analysis_notes.ipynb. E-mail your files as three separate attachments to jc33@nau.edu.


This assignment involves large data files and requires a working QIIME installation. You should work in the QIIME Virtual Box for this assignment. Remember that to run a bash (i.e., command line) command from the IPython notebook you should start that command with !.


This assignment is designed to force you to use existing resources (internet, primary literature) to learn to use an existing bioinformatics tool to address a biological question. Because you’re expected to learn some of this on your own, this homework will involve additional effort relative to the others this semester. It will be a lot easier if you begin by working through the QIIME Overview Tutorial. See the Illumina Overview Tutorial IPython Notebook, which illustrates how to run these analyses in an IPython Notebook.

Begin by reading Fierer et al. You will use QIIME to reproduce the analyses presented in this paper.

Data analysis: You will perform a complete QIIME analysis of the data set presented in Fierer et al, and turn in the following items:
  • A 3 page (maximum!) paper describing your analysis. Write this as if you’re submitting to a journal, so should contain an Introduction section describing the hypotheses being addressed and the strategy for addressing these (refer to Fierer et al), a Methods section containing a brief description of your bioinformatics methods (e.g., what version of QIIME, what type of OTU picking was used) and how the data was generated (e.g., sequencing platform), and a Results section describing the results of your analysis. Your 2-3 pages should include a beta diversity PCoA plot (generated by beta_diversity_through_plots.py; focus on Unweighted UniFrac, which is what we discussed in class) in a view that supports your conclusions, and an alpha rarefaction plot (generated by alpha_rarefaction.py). You should also include a table that lists the five OTUs that are most significantly different across the Subject category in your mapping file (generated by otu_category_significance.py). Figures and tables should take up no more than one total page of your paper. This paper must be turned in as a PDF - .doc or other word processing formats will not be accepted.
  • Evenly sampled OTU table (generated by beta_diversity_through_plots.py). This should be provided as a gzipped .biom file.
  • IPython Notebook containing the full list of commands that you ran to generate the above data, noting any problems that you ran into along the way.

The following commands will get you started. Run these after logging in to your QIIME Virtual Box and starting a new IPython Notebook.

# download the Fierer data
!curl -O https://s3.amazonaws.com/s3-caporaso-share/fierer_forensic_keyboard_assignment.tgz > fierer_forensic_keyboard_assignment.tgz

# unpack the tgz file and change to the resulting directory
!tar -xvzf fierer_forensic_keyboard_assignment.tgz
!cd fierer_forensic_keyboard_assignment

# generate .fna and .qual files from the sff file
!process_sff.py -i ./

The steps in the QIIME Overview Tutorial are the next place to go from here. Good luck!

Homework 5: Metrics of diversity (15 points)

Download the assignment from here. Complete the assignment, and turn in in class or before class to Mr. Chase. All pages must contain your name and be stapled together.

Application presentations


Homework id: app; Extension: pdf; The assignment should be named <group-number>_app.pdf and <group-number>_app_slides.pdf, so for example Group 1’s assignments would be named group1_app.pdf and group1_app_slides.pdf.


Each group will be pre-assigned an article seven days before their presentation date. The students will present their article in class the day they’re assigned. Each member of the group will present part of the material. Answers to the following questions will be turned in (by email, with all group member names included). These answers should form an approximately two-page report.

  1. What is the biological problem that the authors are trying to address?
  2. What is the motivation for addressing this problem?
  3. What previous work has been done in this area? Are there pre-existing tools that address this problem?
  4. What computational technologies did the authors make use of to create this tool (e.g., programming language, databases, etc)?
  5. What preexisting biological resources (e.g., sequence databases) did the authors make use of (if any)?
  6. What is the input to this tool?
  7. What is the output of this tool?
  8. How did the authors test this tool? Was performance benchmarking included in their paper?
  9. How did the authors evaluate whether this tool was giving biologically meaningful results?

Presentations will address these same questions, and will additionally include a live demo of the software where the presenters show/discuss the input data, run the application, and show/discuss the output. You presentation should be around 20 minutes, including the live demo.


All students in a group will receive the same grade on this assignment, unless there is clear evidence that some student(s) didn’t contribute.


Group 1 (3/11/13): jrh355 etb36 rwf25 hhh34 (paper and supplementary material - both are required reading!)

Group 2 (3/11/13): gz38 kn95 sk367 ad572 (paper)

Group 3 (3/13/13): bs527 eca37 kh832 ajc388 (paper and website)

Group 4 (3/13/13): esm23 msk53 pja43 (paper)

Homework 4: Tree of life (15 points)


Homework id: tol; Extension: py or ipynb (you can either build this as an IPython notebook or a stand-alone python script), tre and pdf; For this assignment, the files I turn in would be named jgc53_tol.py (or jgc53_tol.ipynb), jgc53_tol.tre and jgc53_tol.pdf.

In this assignment you will make use of the PyCogent software package to automate the process of constructing a phylogenetic tree from a set of genes. This will including querying NCBI to obtain sequences, performing a multiple sequence alignment, building a phylogenetic tree, writing a newick string containing that tree to file, and writing a visualization of that tree to a PDF file.

Your script must define a function called obtain_sequences_and_build_tree that takes: 1. a list of queries (as strings) to be run against NCBI; 2. a list of query labels (also as strings) to label the sequences resulting from each query in the final tree; 3. the filepath where the output newick string should be written; 4. the filepath where the output pdf should be written; 5. an optional parameter n which defines how many randomly chosen query results should be chosen for each of the queries. The default value for n should be 5.

Your obtain_sequences_and_build_tree function must return a phylogenetic tree derived from n aligned representatives of each of the queries passed via parameter 1. Your function definition should look exactly like this, where you replace # do a bunch of work with your code:

def obtain_sequences_and_build_tree(queries,
    # do a bunch of work
    return tree

As part of your analysis, you should filter any sequences that have one or more N characters in them. Each sequence label in the output tree should begin with the query label corresponding to that sequence. tree should be a PyCogent PhyloNode object (the output of cogent.app.fasttree.build_tree_from_alignment).

In your script, you should call the function you define as follows:

     ['"small subunit rRNA"[ti] AND archaea[orgn]',
      '"small subunit rRNA"[ti] AND bacteria[orgn]',
      '"small subunit rRNA"[ti] AND eukarya[orgn]'],
     ['A: ','B: ','E: '],

where <nau-id> is replaced with your NAU identifier. This should perform all of the analysis steps and write the newick file and PDF to the directory where you are running the script from. You’ll turn in the script, the newick file, and the PDF.


This page should help quite a lot.


The QIIME VirtualBox has PyCogent, muscle, and FastTree preinstalled. Working there will save you a lot of time on software installation.


Remember that you can call dir() on an object to find out what methods are available to that object. One of the methods associated with your tree object will help you generate a newick formatted tree.

Homework 3: Alignments (25 points)


This is a big assignment. Start early!


Homework id: align; Extension: ipynb; For this assignment, the file I turn in would be named jgc53_align.ipynb.


For this assignment you should work in the QIIME Virtual Box, or in another local IPython installation. You may not use the class IPython Notebook server for this, since it is not a mutli-user environment (i.e, other students will see your work). After installing the QIIME Virtual Box (instructions here), you can start IPython by opening a terminal and typing ipython notebook. Leave the terminal window open, and open the URL that is printed to the terminal.

Begin with the Needleman-Wunsch implementation in the Lecture 10 IPython Notebook and the materials in the Lecture 8-10 slides.

For this assignment you will turn in an IPython notebook. You will generate this notebook by starting with the Lecture 10 IPython Notebook and modifying to add new functionality and annotation.

Part 1

Add a new function with this exact form:


This function should return, in this order, the aligned sequence 1 as a string, the aligned sequence 2 as a string, and the score of the global alignment.

To confirm that this is working for you, you should test with the following command, as this is one of the tests that we will apply to your homework:


which should result in the following output:

("HEAGAWGHE-E", "--P-AW-HEAE", 1.0)

Part 2

In the same notebook, define a new function of the form:


Which returns a list of n scores for aligning each of n random sequences of the same length as query_sequence against subject_sequence.

Next, define a function that takes a query sequence, a subject sequence, and a value n with this form:


This function should call generate_random_score_distribution to generate a list of scores for random alignments. It should then compute the score for aligning query_sequence against subject_sequence. The return value of this function should be the number of random alignment scores that are better or equal to the actual alignment score divided by n.

After defining this function, use it to compare the following sequences to one another using a value of n=1000 when calling fraction_better_or_equivalent_alignments as follows:

query1 = "RHT"
query2 = "RHTSWIL"

Each of these query sequences is designed to be similar to the subject. Also compare some randomly generated query sequences to the subject sequence. Do this several times. In a markdown cell just below this analysis, describe any general patterns that you notice. What do you think this means? Run this example on the alignment we worked through in class (query sequence: HEAGAWGHEE; subject sequence: PAWHEAE) and describe the results. How does this alignment compare to your randomly generated alignments?


In the Lecture 8 IPython Notebook there is code illustrating how to generate a random sequence of bases at a given sequence length (see the last cell where root_sequence is defined). Here we’re working with protein sequences, so the alphabet is different but the process is the same.


In my Lecture 8-10 slides I provide details on the differences between SW and NW initialization, scoring, and traceback.

Part 3

Define a general function that can perform global (Needleman-Wunsch; NW) or local (Smith-Waterman; SW) alignments.

Define a new function, generate_sw_and_traceback_matrices with the following form:


The return value should be the dynamic programming matrix and the traceback matrix for a SW alignment.


This will be much easier if you start with the generate_nw_and_traceback_matrices and modify it for Smith-Waterman.

Define a new function sw_traceback with the form:


This function should return aligned the aligned sequences in the order they were passed in and the alignment score.


This will be much easier if you start with the nw_traceback and modify it for Smith-Waterman.

Next, define a new function sw_align with the form:



This will be much easier if you start with your nw_align function and modify it for Smith-Waterman.

Define a new function align with the following form:


Where local is a boolean (i.e., True or False) value. This function should return aligned_sequence1, aligned_sequence2, and the best alignment score. If local==False, an NW alignment should be performed. If local==True an SW alignment should be performed.

Run both local and global alignments as follows to test that this is working as expected:

align('HEAGAWGHEE','PAWHEAE',blosum50, False)

which should result in the following output:

("HEAGAWGHE-E", "--P-AW-HEAE", 1.0)


align('HEAGAWGHEE','PAWHEAE',blosum50, True)

which should result in the following output:

("AWGHE", "AW-HE", 28.0)

Guest lecture reports (due 11 February 2013) (15 points; 7.5 points each)

For each of the two guest lectures, turn in answers to the questions in this document. You can download this document and use it as a template for your assignment. You will turn these in as two separate PDFs by email to jc33@nau.edu. Taking detailed notes during these lectures will make this assignment a lot simpler!


Homework ids: johnson_lecture and butterfield_lecture; Extension: pdf; For this assignment, the files I turn in would be named jgc53_johnson_lecture.pdf and jgc53_butterfield_lecture.pdf.

BLAST exercises (due 4 February 2013) (20 points)

Using NCBI nucleotide BLAST, complete the assignment worksheet. You should turn in a PDF of that file with all answers filled in by email to jc33@nau.edu.


Homework id: blast; Extension: pdf; For this assignment, the file I turn in would be named jgc53_blast.pdf.


This assignment is derived from BLASTing Through the Kingdom of Life. You may find this tutorial to be very helpful.

Query sequences:


GC content (due 23 January 2013) (10 points)

Download a genome and compute its GC content. Copy or download the assignment, fill in your answers, and turn the assignment in by email as a PDF.

Note that there are various ways that you can just look up the GC content, including via the IMG website. I’m asking you to compute it, and you’re being graded on your descriptions. Getting the right answer is a bonus (i.e., if you spend a couple of hours trying, and get it wrong, you’ll be graded on your well-documented effort, not your final answer).

Hints: Start with the IMG Genome Browser, and work with a bacterial, archaeal or viral genome.

Be creative - there are many ways to achieve this.


Homework id: gc_content; Extension: pdf; For this first assignment, the file I turn in would be named jgc53_gc_content.pdf.