Retrovirus Integration Database (RID): a public database for retroviral insertion sites into host genomes
© The Author(s) 2016
Received: 20 May 2016
Accepted: 17 June 2016
Published: 4 July 2016
The NCI Retrovirus Integration Database is a MySql-based relational database created for storing and retrieving comprehensive information about retroviral integration sites, primarily, but not exclusively, HIV-1. The database is accessible to the public for submission or extraction of data originating from experiments aimed at collecting information related to retroviral integration sites including: the site of integration into the host genome, the virus family and subtype, the origin of the sample, gene exons/introns associated with integration, and proviral orientation. Information about the references from which the data were collected is also stored in the database. Tools are built into the website that can be used to map the integration sites to UCSC genome browser, to plot the integration site patterns on a chromosome, and to display provirus LTRs in their inserted genome sequence. The website is robust, user friendly, and allows users to query the database and analyze the data dynamically. Availability: https://rid.ncifcrf.gov; or http://home.ncifcrf.gov/hivdrp/resources.htm.
KeywordsRetrovirus HIV Integration site Database Integration site assay ISA Expanded clones
For a retrovirus to replicate, the virus must integrate a DNA copy of its genome, producing a provirus in the genome of the infected host cell. Research into host integration sites of retroviral genomes has been on-going for many years [2, 8, 13, 14]. Insertion into regions near host genes can affect the expression of the host gene. If the host gene has an important role in controlling cell growth and division, integration can cause clonal cell expansion, and may be involved in the development of malignancy [1, 12, 15, 18].
We collected retrovirus integration sites information from published papers or by directly contacting the authors when the information that was not readily available in the published papers (see acknowledgements). For consistency, we only extracted host, chromosome, integration site, virus type or subtype, proviral orientation, and LTR from those datasets and then we performed gene mapping (including intron/exon mapping) using NCBI genome. This local gene annotation database is derived from NCBI genomes (http://www.ncbi.nlm.nih.gov/genome/). If an integration site is not in a gene, then the nearest genes in both directions were mapped and stored in RID. All gene annotations were based on human genome build GRCH37/hg19. For the raw data using older genome builds, the integration sites were converted to hg19 using LiftOver, a genome converting tool provided by UCSC Genome Bioinformatics (http://genome.ucsc.edu/cgi-bin/hgLiftOver). Proviruses orientations have been converted to a uniform standard: if a provirus is integrated in the same orientation as the target chromosome (using the UCSC numbering convention), it is defined as “+”, otherwise, it is defined as “−”.
RID provides a common place to store and retrieve information describing retroviral integration sites. It is intended for public use and requires no login information. The database stores information on the sites of retroviral integrations into host genomes, the host type, virus type and subtype, a description of the sample origin, such as tissue type, and the reference from which the data originated. The integration site information is presented in a table that includes the host chromosome number, the specific coordinates of integration, the nearest gene, whether the integration site was identified from the retroviral 5′LTR or 3′LTR; and, if the integration site is in a gene, whether it is in an exon or an intron. Currently, RID includes valid data from retroviral insertion sites of HIV-1, HTLV-1, and MLV from multiple publications [4, 5, 7, 9–12, 14, 16, 18] and the database is intended to include integration site information from other retrovirus as more data become available. All of the data in RID have been mapped to a recent completely annotated genome build for the specific host, for example, human genome hg19 for HIV-1 and HTLV-1.
Accessing information on the database
The database can be accessed using current version of web browsers including Internet Explorer, Chrome, Firefox, and Safari. It is compatible with PC, Mac, iPad, and cellphones. The main menu for the RID web interface is divided into five sections (Fig. 1): Choose virus and subtype, Choose host and chromosomes, Query options, Integration site information selection, and Advanced queries. The main menu allows users to access data by searching for integration sites for a specific virus or a specific viral subtype in the “Choose virus and subtype” section. Users then can access the data by selecting a specific host type and one or all of the chromosomes from “Choose host and chromosomes” section. Users can then select the “Submit Query” button to display the query result.
Users can limit their query by choosing an option in the “Query option” section. For example, a nucleotide position range on a specific chromosome can be chosen to search for integration sites within a specific region of the host genome or users can search query integration sites based on genes, the PubMed ID of one or two specific publications, or a sample name or a tissue type to narrow the query. The “ADVANCED QUERIES” section can be used to find integrations that have been reported in the same genes across multiple studies.
Uploading data to the database
Users are encouraged to submit their published data to RID. The detailed submission instruction and templates can be accessed in Data Submission tab (Fig. 1). Generally speaking, only data from published peer-reviewed studies will be accepted and made available on the website. We reserve the right not to post data if inspection of the submitted data shows that there are obvious problems with the dataset. In that case, we would contact the authors for clarification.
We have built a large scale, robust relational database called the Retroviral Integration Database (RID) which will be used to store publically available retrovirus integration site data. Users can query all available integration sites or specifically analyze integration sites in specific chromosomes, genes, tissues, etc. Several useful tools are built into the website that are designed to help map integration sites to the UCSC genome browser, to plot integration sites on particular chromosomes, and to determine the flanking host sequences. This database can be used to facilitate meta-analyses of retrovirus integration sites and their chromosomal distribution.
WS initiated, designed the database, web interface, and wrote scripts to construct the database and analysis tools. JS designed the database, and designed and wrote scripts to construct the web interface to interact with the database. MK, XW, FM, JWM, BL, JMC, and SHH contributed to the design of the database. All authors read and approved the final manuscript.
JMC was a Research Professor of the American Cancer Society.
The authors thank Anne Arthur for adding RID to the NCI HIV DRP web page (http://home.ncifcrf.gov/hivdrp/resources.html), Jon Spindler, David Wells, Shawn Hill, Valarie Boltz, Ann Wiegand, and Uma Mudunuri for their valuable discussions and advice. The authors thank Lucy B. Cook and C. R. Bangham for providing HTLV-1 integration sites and Matthew C. LaFave, M. Burgess for providing MLV integration sites. We thank Connie Kinna and Valerie Turnquist for administrative support.
The authors declare that they have no competing interests.
We acknowledge the funding sources for this study from NCI CCR, the Office of AIDS Research, NIH, and NCI Contract No. HHSN261200800001E.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Biasco L, Baricordi C, Aiuti A. Retroviral integrations in gene therapy trials. Mol Ther. 2012;20:709–16.View ArticlePubMedPubMed CentralGoogle Scholar
- Coffin JM, Hughes SH, Varmus HE. Retroviruses. Cold Spring Harbor: Cold Spring Harbor Laboratory Press; 1997.Google Scholar
- Cole CG, McCann OT, Collins JE, Oliver K, Willey D, Gribble SM, Yang F, McLaren K, Rogers J, Ning Z, Beare DM, Dunham I. Finishing the finished human chromosome 22 sequence. Genome Biol. 2008;9:R78.View ArticlePubMedPubMed CentralGoogle Scholar
- Cook LB, Melamed A, Niederer H, Valganon M, Laydon D, Foroni L, Taylor GP, Matsuoka M, Bangham CR. The role of HTLV-1 clonality, proviral structure, and genomic integration site in adult T-cell leukemia/lymphoma. Blood. 2014;123:3925–31.View ArticlePubMedPubMed CentralGoogle Scholar
- De Ravin SS, Su L, Theobald N, Choi U, Macpherson JL, Poidinger M, Symonds G, Pond SM, Ferris AL, Hughes SH, Malech HL, Wu X. Enhancers are major targets for murine leukemia virus vector integration. J Virol. 2014;88:4504–13.View ArticlePubMedPubMed CentralGoogle Scholar
- Dunham I, Shimizu N, Roe BA, Chissoe S, Hunt AR, Collins JE, Bruskiewich R, Beare DM, Clamp M, Smink LJ, Ainscough R, Almeida JP, Babbage A, Bagguley C, Bailey J, Barlow K, Bates KN, Beasley O, Bird CP, Blakey S, Bridgeman AM, Buck D, Burgess J, Burrill WD, O’Brien KP, et al. The DNA sequence of human chromosome 22. Nature. 1999;402:489–95.View ArticlePubMedGoogle Scholar
- Han Y, Lassen K, Monie D, Sedaghat AR, Shimoji S, Liu X, Pierson TC, Margolick JB, Siliciano RF, Siliciano JD. Resting CD4+ T cells from human immunodeficiency virus type 1 (HIV-1)-infected individuals carry integrated HIV-1 genomes within actively transcribed host genes. J Virol. 2004;78:6122–33.View ArticlePubMedPubMed CentralGoogle Scholar
- Hughes SH, Shank PR, Spector DH, Kung HJ, Bishop JM, Varmus HE, Vogt PK, Breitman ML. Proviruses of avian sarcoma virus are terminally redundant, co-extensive with unintegrated linear DNA and integrated at many sites. Cell. 1978;15:1397–410.View ArticlePubMedGoogle Scholar
- Ikeda T, Shibata J, Yoshimura K, Koito A, Matsushita S. Recurrent HIV-1 integration at the BACH2 locus in resting CD4+ T cell populations during effective highly active antiretroviral therapy. J Infect Dis. 2007;195:716–25.View ArticlePubMedGoogle Scholar
- LaFave MC, Varshney GK, Gildea DE, Wolfsberg TG, Baxevanis AD, Burgess SM. MLV integration site selection is driven by strong enhancers and active promoters. Nucleic Acids Res. 2014;42:4257–69.View ArticlePubMedPubMed CentralGoogle Scholar
- Mack KD, Jin X, Yu S, Wei R, Kapp L, Green C, Herndier B, Abbey NW, Elbaggari A, Liu Y, McGrath MS. HIV insertions within and proximal to host cell genes are a common finding in tissues containing high levels of HIV DNA and macrophage-associated p24 antigen expression. J Acquir Immune Defic Syndr. 2003;33:308–20.View ArticlePubMedGoogle Scholar
- Maldarelli F, Wu X, Su L, Simonetti FR, Shao W, Hill S, Spindler J, Ferris AL, Mellors JW, Kearney MF, Coffin JM, Hughes SH. HIV latency. Specific HIV integration sites are linked to clonal expansion and persistence of infected cells. Science. 2014;345:179–83.View ArticlePubMedPubMed CentralGoogle Scholar
- Serrao E, Engelman AN. Sites of retroviral DNA integration: from basic research to clinical applications. Crit Rev Biochem Mol Biol. 2016;51:26–42.Google Scholar
- Sherrill-Mix S, Lewinski MK, Famiglietti M, Bosque A, Malani N, Ocwieja KE, Berry CC, Looney D, Shan L, Agosto LM, Pace MJ, Siliciano RF, O’Doherty U, Guatelli J, Planelles V, Bushman FD. HIV latency and integration site placement in five cell-based models. Retrovirology. 2013;10:90.View ArticlePubMedPubMed CentralGoogle Scholar
- Shin MS, Fredrickson TN, Hartley JW, Suzuki T, Akagi K, Morse HC 3rd. High-throughput retroviral tagging for identification of genes involved in initiation and progression of mouse splenic marginal zone lymphomas. Cancer Res. 2004;64:4419–27.View ArticlePubMedGoogle Scholar
- Singh PK, Plumb MR, Ferris AL, Iben JR, Wu X, Fadel HJ, Luke BT, Esnault C, Poeschla EM, Hughes SH, Kvaratskhelia M, Levin HL. LEDGF/p75 interacts with mRNA splicing factors and targets HIV-1 integration to highly spliced genes. Genes Dev. 2015;29:2287–97.View ArticlePubMedPubMed CentralGoogle Scholar
- Sunshine S, Kirchner R, Amr SS, Mansur L, Shakhbatyan R, Kim M, Bosque A, Siliciano RF, Planelles V, Hofmann O, Ho Sui S, Li JZ. HIV integration site analysis of cellular models of HIV latency with a probe-enriched next-generation sequencing assay. J Virol. 2016;90:4511–9.Google Scholar
- Wagner TA, McLaughlin S, Garg K, Cheung CY, Larsen BB, Styrchak S, Huang HC, Edlefsen PT, Mullins JI, Frenkel LM. HIV latency. Proliferation of cells with HIV integrated into cancer genes contributes to persistent infection. Science. 2014;345:570–3.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang GP, Ciuffi A, Leipzig J, Berry CC, Bushman FD. HIV integration site selection: analysis by massively parallel pyrosequencing reveals association with epigenetic modifications. Genome Res. 2007;17:1186–94.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang H, Jurado KA, Wu X, Shun MC, Li X, Ferris AL, Smith SJ, Patel PA, Fuchs JR, Cherepanov P, Kvaratskhelia M, Hughes SH, Engelman A. HRP2 determines the efficiency and specificity of HIV-1 integration in LEDGF/p75 knockout cells but does not contribute to the antiviral activity of a potent LEDGF/p75-binding site integrase inhibitor. Nucleic Acids Res. 2012;40:11518–30.View ArticlePubMedPubMed CentralGoogle Scholar