Exercise: Searching the GenBank database

From teaching

Jump to: navigation, search

Exercise written by: Rasmus Wernersson and Henrik Nielsen



This exercise has two main goals:

1) Introduction to the types of DNA data contained in the GenBank database (data format, visualization, cross-database links, how biological "features" such as genes are annotated and described as coordinates in the DNA sequence).

2) Practice searching the online version of GenBank hosted at the NCBI. Since the number of sequences in GenBank is HUGE it's critically important to be able to search and filter the information. Especially filtering the unwanted sequences can be a challenge, as we shall see.

Where to find GenBank

The GenBank database is hosted at NCBI (National Center for Biotechnology Information, USA) (Link: http://www.ncbi.nlm.nih.gov/). Besides the main GenBank database, NCBI also hosts a number of other biological databases (for example whole-genome databases for human, mouse, chimp etc.). In this particular exercise we will concentrate on the classical "GenBank" database (http://www.ncbi.nlm.nih.gov/genbank/).

Using the "Entrez" database browser

ALL the NCBI databases can be queried through a common search interface named Entrez. On next to all NCBI webpages a search box can be found in the upper part of the page, allowing an easy access for searching the individual databases (or searching across all databases). Click on the following link to open up a new browser window with Entrez, where the focus is pre-set to search in the GenBank database:


(Alternatively go to the main NCBI webpage and chose "Nucleotide" as the database).

Part 1: Concerning the DATA in GenBank

This part of the exercise is about the types of data hosted in GenBank.

Searching for a specific ID

The typical case for searching for a specific ID in GenBank, will be looking up information from the literature (e.g. a gene found in a study), following up on information from other databases, investigation of lists of interesting genes etc. In this part of the exercise we will be working with a set of alpha-globin genes.

  • Search for AB001981 - by default the result is shown in the GenBank format.
a) How many genes are contained in this entry?
b) From which organism does the DNA originate?
c) What kind of information is contained within the HEADER and within the FEATURE block?

PubMed links

Notice that the publication from which the DNA sequence originates is cited (and linked via a PubMed ID) within the header. Sometimes multiple publications related to the same gene is listed. This is of great importance since it makes it possible trace the source(s) of the DNA sequence and investigate if the experiments carried out are to be trusted.

This can be of real importance if something seems "wrong" with the sequence (for example if this particular gene exhibits a really strange intron/exon structure compared to other closely related genes, or if it simply doesn't match ANY other known genes of the same family). By investigation of the original publication it's possible to double-check the experimental procedure. It may be that the article correctly states the gene to be of type XXX but when that data submitted it was accidentally annotated as YYY (it is the original researchers' responsibility to double-check this). There can also be more serious problems with the experiments ranging from bad/wrong PCR primers, to contamination with DNA from a different species during a cloning step.

NEVER FORGET: biological data CAN be wrong.
  • Investigate the PubMed link(s):
    • Follow the PubMed link from the sequence entry.
    • Observe that it is always possible to read the ABSTRACT of the publication in PubMed, even if access to the publication requires subscription. For most (new) publications there will also be a direct link to the publication itself.
    • Return to the sequence entry once again (or perform the search again if you closed the window).

GenBank vs. FASTA format

  • View the sequence entry in FASTA format (Simply click on "FASTA" in the top part of the page, below the page title)
    Now the entire GenBank entry is shown in FASTA format.
a) What happened to the alpha-globin genes? Can they still be found?
b) Which part of the GenBank entry has been converted?
Observe that the name of the sequence is based on the name of the GenBank entry.
  • Go back to GenBank format (Click on "GenBank")

TASK: Save the GenBank "raw data" on your own computer:

  • Click on "Send:" in the upper right part of the page
  • Choose "Complete Record", "File" and "Genbank(full)" and click on "Create file"
  • Locate the downloaded file on your own computer
  • By default it has a pretty generic name ("sequence.gb") - rename the file to "AB001981.gb"
    Notice: The reason for renaming the file is simply a practice of good file management - now we can by just skimming the filenames guess that it's a GenBank file ("*.gb") and that it contains the "AB001981" entry.
  • Open it in nedit.
    Notice: What we have now is the "raw" data behind the information shown online, with no fancy HTML formatting and cross-links.
  • Verify that the contents of the file is as expected (it should look exactly like the information shown online).

QUESTION 1.3: Does the downloaded file have UNIX or Windows line-endings?

Exploring the genes defined in a GenBank entry

Go back to the GenBank entry in your browser. Click the first "CDS" element (Alpha-D)

  • CDS = CoDing Sequences: The PROTEIN CODING part of a gene. Basically: the sequence you get when the CODING exons are concatenated (UTR regions are ignored). A CDS always starts with a START codon and ends with a STOP codon.
  • Hopefully it's quite intuitive why some of the sequence is high-lighted - otherwise discuss it within the group (or with the instructor)

Repeat the same procedure for the other CDS (Alpha-A).

  • When looking at the FEATURE table, the first line of text in the definition of each CDS is as follows:
QUESTION 1.4: Based on your observations:
a) What do these numbers mean?
b) How many coding exons does each gene contain?
  • View the first CDS (Alpha-D) in FASTA format (click FASTA in the bottom right corner, after clicking CDS)
QUESTION 1.5: What do the numbers in the sequence title represent?
  • Switch to Graphic view (Click on Graphics at the top of the page)
    An interactive graphical representation of the GenBank entry will now be shown. The upper part of the visualization shows the entire length of the entry (5.891 bp) with bars representing the individual exons within the two genes.
    • This zoomed view below can be changed by dragging the transparent box with the blue borders in the overview representation at the top of the page.
    • The zoom level can be changed.
    • By "mousing over" the bars additional information about that particular feature will be shown.

The graphical overview is mostly useful for inspecting GenBank entries with multiple genes (some entries have hundreds of embedded genes). Play around with the interface for a few minutes to see what functionality is offered.

Part 2: Searching GenBank

The key issue to keep in mind when searching GenBank is to avoid drowning in huge amounts of irrelevant data. It is therefore of great importance to filter out unwanted information, WITHOUT losing the relevant entries. Today we will work with searching the TEXTUAL annotation of GenBank entries (keywords, free text etc). We will later get back to sequence based searches (BLAST).

In the first part of the exercise we'll investigate various ways to search using insulin as the example.

Naïve search

Search for GenBank entries containing the term "insulin"

  • Just do a simple search for INSULIN - don't put anything else in the search box.

Observe the following:

  • A large number of entries are found.
  • Go through a few pages of results and notice that we are offered data from a diverse set of sources: Experimental work, Patent applications, predicted genes, partial genes etc.
a) How many search results were returned? (only the "Nucleotide" hits, not the "EST" and "GSS" hits)
b) Are they all from Human? If no, give a counterexample. (Would you have expected them to be all human?)
c) Are they all insulin? If no, give a counterexample.

By default the search term is matched against ALL POSSIBLE fields in the GenBank entries - including almost all text in the HEADER and FEATURE table. It's even possible to pick up entries where the match is to one of the authors names and not a gene name! (Perhaps not an issue for insulin). Luckily it is possible to restrict the search to specific pre-indexed fields in the HEADER and FEATURE table ("Search fields"), which makes it possible to make the search much more focused.

How the search is interpreted

When you do a naïve search (just write a few terms google-style) GenBank tried to interpret what you most likely meant, it is has a behind-the-scene scheme to sorting the results to push the most interesting ones to the top. It is actually possible to see exactly how your search query is interpreted by locating the SEARCH DETAILS box.

a) What have your search for "insulin" been expanded into?

Spend a few moments to investigate the HEADER section of the GenBank entry you have all received as a hand-out (X01831) to get an idea of how the data is related to specific sections (e.g. KEYWORDS and ORGANISM which we will use in a moment).

Try to find a search result that appears NOT to be the real insulin gene, and see why it was picked up by the search. If you have trouble finding one in your own result, search for DL142095.1 which came up around page 200 when the exercise was written.

The main issue here is that we find entries where "insulin" is mentioned anywhere in the entry, and sometimes it's unrelated genes like "Insulin-receptor", "Insulin inhibitor" etc.

Searching for human insulin

Search for HUMAN INSULIN and see what happens.

a) How many search results were returned?
b) Can you find the human insulin entry?
c) How was your search interpreted by the system?

Advanced search

Looking at the SEARCH DETAILS from the naïve searches we have just performed, give us a good idea on how we can build our own more powerful searches. This can be done in two ways:

  1. Simply writing the advanced search string yourself (e.g. "insulin[title]" - to search in the title field)
  2. Using the "Search builder" to put together the query bit by bit.
But why - naïve searching for "human insulin" went so well?
  • If you just need a single (and well-known) gene from one of the well-known model organism, it will indeed work very well to do a simple search. (Much like when you do a Google search and get your desired hit on the first page).
  • However, there are some situations where it's beneficial to specify the search in more details - e.g. for building data sets of the same gene across multiple species, or just trying to locate a slightly more obscure gene. (Same as when the link you were looking for at Google was on page 10+ and you have to provide more accurate search terms).
It's possible to restrict the search to specific fields in the GenBank entires (click to open the entire list)

Now we are going to narrow down the search to specific parts of the annotation.

  • Click on Advanced in the top of the page.
    This brings up a form with a "Search Builder" that can be used to select and combine terms restricted to specific fields.
  • Select "Organism" and enter human.
  • Select "Title" and enter insulin.
  • Click "Search"
a) How many hits do we have now?
b) Are they all from Human? If no, give a counterexample.
c) Do they all appear to be insulin genes? If no, give a counterexample.
  • Now use the "Search Builder" to search for insulin in other fields instead of "Title" (still with "Organism" set to human)
a) How many hits are found when "Keyword" is set to insulin?
b) How many hits are found when "Protein Name" is set to insulin?
c) Find the correct Human Insulin gene entry (the correct hit). Write down its accession number, Locus name and Definition (title).

Note that the "Search Builder" simply is a tool for filling out the search box. If you know the names of the available search fields, it is often more convenient to type your search with the field names manually. A schematic overview of the search fields can be found on the NCBI homepage: Search Fields and Qualifiers.

Combining search terms using boolean operators: NOT, AND and OR

Our next task will be to find full length insulin genes from as many different organisms as possible using the Title field. Note that it might have been easier to use the Protein name or Keyword fields, but with Title we can immediately see the results of what we are doing, so we are using it for pedagogical reasons. We will now type the searches directly into the Search Box without using the Search Builder.

  • Let's start out with a new clean search for Insulin:



The number of hits is very high, and there are many partial genes and mRNA entries.

  • Let's now specify that the entries should be complete:
insulin[title] AND complete[title]

About the use of AND: The AND keyword is implicitly used when ever you enter more than one search term: "human globin" will be interpreted as "human AND globin" and only results where BOTH terms are found will be reported. We could therefore have omitted the "AND" in the previous query.

Observe that we still have many hits that are not actually insulin, so we want to add search terms to AVOID in order to bring down the false positive rate. By a brief inspection of some of the search hits, it turns out that some of them are, e.g., insulin receptors.

  • Let's get rid of these with the NOT keyword:
insulin[title] complete[title] NOT receptor[title]

Conceptually what we are doing here is to conduct a number of searches that are either COMBINED or SUBTRACTED from each other. The "receptor[title]" search term finds all entries where this term is found. This list is then excluded from the combined "insulin[title] AND complete[title]" list by using the NOT operator.

The use of boolean operators can be visualized graphically using Venn diagrams (see the figure to the right). A good strategy for narrowing down a GenBank search is to build a list of "kill words"/"filter words" (terms to avoid). More terms can be added to the list as search results are inspected, and it's found out why strange entries appear on the result list.

A word of caution: Be careful of not throwing the baby out with the bath water - don't add kill-words that are so broad that they will actually exclude the gene(s) we are looking for. And don't add kill-words without specifying a search field - e.g. the search

insulin[title] complete[title] NOT receptor

would exclude some real insulin hits that just happened to mention "receptor" in some reference!

  • The final part of the exercise to continue to find terms to exclude on your own hand. The point is to bring down the number of search results to a level where it's easy to pick the correct ones. Remember: the task is to find full length insulin genes from as many different organisms as possible using the Title field.
a) Which search term did you end up using?
b) How many search results do you get now?

Notice: There are several possible answers to this question, as it will be a balance between filtering out False Positives (things that are NOT insulin) without filtering out (too many) True Positives (things that are actually insulin).

"Free exercise"

Now it's time to perform a number of GenBank searches on your own. It's important to think about the search strategy - discuss this within the group.

QUESTION 3: Do at least three of the below and report your findings. Remember to write down the search string you ended up using for each question.

  1. Find the Rat and Mouse Insulin gene
  2. Find the alcohol-dehydrogenase gene from as many organisms as possible.
  3. Find the alpha-globin gene from Capra hircus - (Remember: Alpha-globin is part of hemoglobin).
  4. Find the alpha-globin gene from all ruminants - (hint: inspect the ORGANISM fields in a GenBank entry from an animal you know to be a ruminant, in order to pick up a good search term). If you want to go deeper into the taxonomy, the Tree of Life project have an entry on placental mammals here:http://tolweb.org/tree?group=Eutheria&contgroup=Mammalia.
  5. Find the actin gene from as many organisms as possible.
    Avoid mRNA and entries that are part of whole chromosomes, cosmids etc
  6. Find the human insulin receptor gene. Avoid partial genes / single exons in the results.
Personal tools