Bioviz Home
PlantIGB Home
LoraineLab Research

PlantIGB - What data are available?

By default, PlantIGB accesses the PlantQuickload data web site hosted at http://www.bioviz.org/plant_quickload.

The PlantQuickload Web site is just a set of directories (folders) on our server that IGB can access and load via the Internet. If you are a computational biologist interested in doing data-mining experiments with Arabidopsis data, the files in these directories may be very useful. Be sure to read the README.html file(s) for notes on formats and other topics. And if you have questions, send us an email: aloraineuab.edu.

The data files currently available in PlantQuickload include the following:


Arabidopsis thaliana


TAIR version 7 annotations

These data sets include the following:
IGB menu name Description
TAIR7_protein_coding_gene genes encoding proteins
TAIR7_mirna genes encoding microRNAs
(example: AT4G05105.1.)
TAIR7_rrna genes encoding ribosomal RNAs
TAIR7_pre-trna genes encoding tRNAs
TAIR7_snorna genes encoding small nucleolar RNAs
example: AT4G13245.1
TAIR7_pseudogene pseudogene
example: AT5G20800.1
TAIR7_snrna genes encoding small nuclear RNAs
example: AT5G09585.1
TAIR7_other_rna genes encoding other types of RNAs
not in previously-listed categories,
such as potential natural antisense genes
example: AT5G40348.1

These data all are from the file named TAIR7_GFF available from The Arabidopsis Information Resource (TAIR) ftp site, downloaded in April, 2007.

To load data into IGB, click the checkboxes under the Data Access tab. Each data set will appear in a separate track. To find out more about a particular annotation, right-click the annotation (or control-click on Mac) and select the arabodopsis.org option, which should tell your Web browser to open the corresponding locus page at TAIR. The new page should tell you what the category ("Gene Model Type") the gene belongs to -- these should match the tier label in IGB.


TAIR version 7 EST and cDNA alignments

These datasets include the following:

IGB label Description
EST_TAIR7mm ESTs that align reasonably well to more
than one location in the genome.
EST_TAIR7sm ESTs that align to just one location
in the genome.
cDNA_TAIR7sm full-length cDNA sequences that align
to a single location in the genome
cDNA_TAIR7mm cDNAs that align reasonably well to more
than one location in the genome.

These data represent genomic alignments for Arabidopsis ESTs and cDNA sequences provided by TAIR; they correspond to the "Transcripts" track in the TAIR SeqViewer tool.

We've divided them into four different tracks for display in IGB. Please note that the EST_TAIR7sm data set is quite large and may take more time to load than the other data sets.

Please note also that there are a variety of methods available for aligning expressed sequences to genomic sequence, and they do not all operate in the same way or produce the same answers for every sequence.

To find out more about the computational pipeline that generated these alignments, visit the Genome Annotation page at TAIR, which describes how the alignment pipeline operates.

These data are from the TAIR ftp site (see: ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR7_pre-release/TAIR7_Transcripts_by_map_position) These correspond to the "transcripts" tier in the TAIR on-line seqviewer (genome browser) tool.

If you would like to work with the source data file from TAIR, you will need to know what the different fields represent:

    Fields:

    0 Locus - AGI locus code
    1 Locus_orientation_is_5 - gene orientation relative to genomic sequence 
    2 Genbank_acc - Genbank accession
    3 external_id - Genbank gi number
    4 Type(1=cDNA_2=EST)
    5 Chromosome number (6 is chloroplast, 7 is mitochondrion)
    6 Transcript_orientation_is_5 - transcript orientation relative to genomic sequence

Note that a 3-prime EST will likely appear on the opposite
strand from the associated gene.

    7 Map_start_coordinate - one-based
    8 Map_end_coordinate - one-based

For visualization in IGB, we subdivided the EST and cDNA alignment annotations into two new annotation subsets based on the number of map positions TAIR has reported for a single expressed sequence:

IGB menu name: EST_TAIR7mm - TAIR7 ESTs that map to more than one location in the genome. (The suffix "mm" stands for multi-mapper.)

IGB menu name: EST_TAIR7sm - TAIR7 ESTs that map to more than one location in the genome. (The suffix "sm" stands for single-mapper.)


ESTs from the Salk SIGnAL site

These data are from a file named

EST-2006-12-19.txt 12-Mar-2007 12:54 14M
which is from:
http://natural.salk.edu/database/transcriptome/ hosted at the Salk Institute.


ATH1 probeset-to-genome alignments from Affymetrix

This data set contains ATH1 probeset-to-genome alignments from Affymetrix. Note that numerous probe sets align to the genome in multiple locations. We have not yet done any quality-testing or screening to sort out why this is the case, but hope to do so in future.

In IGB, probes will appear as light-colored bars superimposed the genomic alignment of the original "design" sequence, which the sequence provided to Affymetrix that represents an intended target transcript for interrogation on the array. To find out more about how the ATH1 array was designed, see:

Redman, et al (2004) Development and evaluation of an Arabidopsis whole genome Affymetrix probe array.

Note that probes typically occupy positions near the three-prime end of the design sequences, with some exceptions.

The data shown in IGB are from a data file provided by Affymetrix. The file from Affymetrix uses a data representation format that captures gaps or insertions in the design sequence relative to the genomic sequence. When you examine the probe set alignments in IGB, you may see immediately adjacent or overlapping blocks in some probe sets. This means that a portion of the design sequence is missing the in the genomic sequence and corresponds to a gap in the genomic sequence relative to the design sequence. Note also that some probes overlap with each other; this is quite common.

You can obtain a copy of the ATH1 probe set alignments from the "Support" section of the Affymetrix Web site. It is likely to be identical to the version posted here.


TAIR version 6 annotations

These data were generated from the sequence viewer data files on the TAIR ftp site. We subdivided the annotations into two datasets: annotations that included an open reading frame (TAIRv6prot) and annotations that did not (TAIRv6non-coding.)


TAIR version 5 annotations

These were generated using the sequence viewer data files on the TAIR ftp site. We subdivided the annotations into two datasets: annotations that included an open reading frame (TAIRv5prot) and annotations that did not (TAIRv5noncoding.)


Genomic sequence

The sequence data are from:

ftp://ftp.arabidopsis.org//home/tair/home/tair/Sequences/whole_chromosomes

and are identical to the Genbank versions listed below, except for the mitochondrial sequence file, which differed in length by one base.

Sequence data files

chromosome TAIR sequence file Genbank equivalent Size (bp) IGB .bnib file
1 ATH1_chr1.1con.01222004 NC_003070.5 30432563 chr1.bnib
2 ATH1_chr2.1con.01222004 NC_003071.3 19705359 chr2.bnib
3 ATH1_chr3.1con.01222004 NC_003074.4 23470805 chr3.bnib
4 ATH1_chr4.1con.01222004 NC_003075.3 18585042 chr4.bnib
5 ATH1_chr5.1con.04172003 NC_003076.4 26992728 chr5.bnib
chloroplast ATH1_chloroplast.1con.01072002 NC_003071.3 154478 chrC.bnib
mitochondrion ATH1_mitochondria.1con.01072002 Y08501.2 366923 chrM.bnib

The IGB "bnib" files are compressed versions of the sequence data files. They are a compressed to reduce the amount of time it takes them to load when you click the "Load all sequence" button under the "Data Access" tab in IGB.