FuncVAR: Annotation and functional enrichment of variant sets

In this vignette we will learn to take a set of variants, arbitrarily defined as any set of GenomicRanges, overlap with multiomics datasets and then calculate enrichment against an appropriately chosen background set.

Setup

First we need to import the packages we’ll be using. FunciVar can be accessed directly from our github repositories via biocLite. GenomicRanges and VariantAnnotation packages are available through bioconductor:

Install the required packages, if they are not already availible.

source("https://bioconductor.org/biocLite.R")
biocLite("Simon-Coetzee/StatePaintR", suppressUpdates = TRUE, dependencies = TRUE)
biocLite("Simon-Coetzee/funciVar", suppressUpdates = TRUE, dependencies = TRUE)
biocLite("Simon-Coetzee/motifbreakR", suppressUpdates = TRUE, dependencies = TRUE)
biocLite("SNPlocs.Hsapiens.dbSNP144.GRCh37")

Then load the packages we’ll be using in this workshop.

library(StatePaintR)
library(funciVar)
library(GenomicRanges)
library(VariantAnnotation)
library(UpSetR)

Getting annotations

Annotations can come from any source, the key requirement be that they be coerced into GRanges format for use by the funciVar package. For our purposes, We can search on StateHub for all cell types that are annotated in hg19 with segmentations combining at least, H3K4me1, H3K4me3, H3K27ac, and CTCF.

StateHub Search for hg19, H3K4me1, H3K4me3, H3K27ac, and CTCF

StateHub Search for hg19, H3K4me1, H3K4me3, H3K27ac, and CTCF

Each segmentation file contains a set of chromatin state calls made by our StatePaintR package.

How does StatePaintR work?

Lets look at StatePaintR: slides and rmarkdown

Returning to Analysis

FunciVar imports these files efficiently by pointing a function called “GetSegmentations” directly. GetSegmentations() expects files to be in BED format with column 4 (Name) corresponding to the state call. Then it’s just a matter of renaming the state abbreviations to easily recognized, human readable form.

## Base url of StateHub files in our search
statehub.encode.aws <- "http://s3-us-west-2.amazonaws.com/statehub-trackhub/tracks/5813b67f46e0fb06b493ceb0/hg19/ENCODE/"
## A list of StateHub urls
segmentation.files <- c(
  paste0(statehub.encode.aws,
         "bipolar_spindle_neuron.8mark.segmentation.bed"),
  paste0(statehub.encode.aws, 
         "dohh2.8mark.segmentation.bed"),
  paste0(statehub.encode.aws, 
         "gm12878.11mark.segmentation.bed"),
  paste0(statehub.encode.aws,
         "hepatocyte.9mark.segmentation.bed"),
  paste0(statehub.encode.aws,
         "induced_pluripotent_stem_cell.7mark.segmentation.bed"),
  paste0(statehub.encode.aws, 
         "mcf-7.12mark.segmentation.bed"),
  paste0(statehub.encode.aws,
         "neutrophil.8mark.segmentation.bed"))
## Import States
encode.segmentations <- GetSegmentations(files = segmentation.files)
esegs <- unlist(encode.segmentations)
esegs
GRanges object with 4400992 ranges and 2 metadata columns:
                         seqnames               ranges strand |                 sample       state
                            <Rle>            <IRanges>  <Rle> |            <character> <character>
  bipolar spindle neuron     chr1     [ 54343,  54728]      * | bipolar spindle neuron         SCR
  bipolar spindle neuron     chr1     [ 83693,  84250]      * | bipolar spindle neuron         SCR
  bipolar spindle neuron     chr1     [ 85703,  86097]      * | bipolar spindle neuron         SCR
  bipolar spindle neuron     chr1     [ 89735,  90145]      * | bipolar spindle neuron         SCR
  bipolar spindle neuron     chr1     [713064, 713592]      * | bipolar spindle neuron         PWR
                     ...      ...                  ...    ... .                    ...         ...
              neutrophil     chrY [59023293, 59023989]      * |             neutrophil         HET
              neutrophil     chrY [59026316, 59026627]      * |             neutrophil         HET
              neutrophil     chrY [59029021, 59029372]      * |             neutrophil         HET
              neutrophil     chrM [      37,      405]      * |             neutrophil          AR
              neutrophil     chrM [   15931,    16269]      * |             neutrophil          AR
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome
## Group States in Human readable collections.
mcols(esegs[esegs$state %in% c("EAR", "AR"), ])$state <- "Active Enhancer"
mcols(esegs[esegs$state %in% c("EPR", "EWR"), ])$state <- "Weak Enhancer"
mcols(esegs[esegs$state %in% c("PAR"), ])$state <- "Active Promoter"
mcols(esegs[esegs$state %in% c("PPR", "PPWR", "PWR"), ])$state <- "Weak Promoter"
mcols(esegs[esegs$state %in% c("CTCF"), ])$state <- "CTCF"
mcols(esegs[esegs$state %in% c("HET", "SCR", "TRS"), ])$state <- "Other"

Importing Variants

There are two potential use cases that we envision for funciVar. In the first case, our enrichment calculations are based on a foreground and background set that are predefined and nothing needs to be done outside of importing to R and coercing into GRanges. Such foreground sets could include any kind of variant; common population polymorphisms or differentially methylated CpGs for example.

In the second case, especially common for primary GWAS studies, a single top “hit” has been identified and a set of variants known to be in linkage disequilibrium (LD) should be calculated. We will cover this latter case. The LD calculations can be used to further subdivide regional variants into foreground and background based on some LD threshold (typically at R² = 0.8). In other instances, more creative methods of obtaining a background set might be warranted.

We will use 1000 genomes data for our LD calculations, but in principle we could apply this to any population stored in VCF. First we need to identify the file containing the 1000 and assign all the sample metadata into a variable called “my.samples”.

my.samples <- read.delim("http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel", stringsAsFactors = FALSE)
my.samples <- my.samples[, c("sample", "pop", "super_pop", "gender")]
my.samples

Next we will identify the file with chromosome 10 variants and define the index region as a filter for the foreground variant set. For simplicity’s sake, we are testing a single region, but we could easily analyze multiple ranges. For this example, we are using a GWAS variant linked to breast cancer on chromosome 10.

chr10.remote.vcf <- "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
index10q26 <- "rs2981579"
pos10q26 <- GRanges("10:123337335-123337335") + 500000
pos10q26
GRanges object with 1 range and 0 metadata columns:
      seqnames                 ranges strand
         <Rle>              <IRanges>  <Rle>
  [1]       10 [122837335, 123837335]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Using the 1MB window that we defined for the index variant, we capture the relevant variants from the VCF file for our LD calculations.

vcf10q26 <- GetVariantsInWindow(file = chr10.remote.vcf, 
                                position = pos10q26, 
                                type = "vcf")

EBI throttles downloads from here pretty hard. So We have a local version of the vcf included also, in case you get the above error.

chr10.local.vcf <- system.file("extdata",
                               "1000genomes.chr10.122836335-123838335.vcf.gz",
                               package="bioc2017")
vcf10q26 <- GetVariantsInWindow(file = chr10.local.vcf, 
                                position = pos10q26, 
                                type = "vcf")

To ensure that we make our LD calculation from the correct population structure, we set the population using metadata from “my.samples”.

vcf10q26 <- SetPopulation(vcf = vcf10q26, sample_sheet = my.samples)
vcf10q26
class: CollapsedVCF 
dim: 31499 2504 
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 27 columns: CIEND, CIPOS, CS, END, IMPRECISE, MC, MEINFO, MEND, MLEN, MSTART, SVLEN, SVTYPE, TSD, AC, AF, NS, AN, ...
info(header(vcf)):
                 Number Type    Description                                                                                           
   CIEND         2      Integer Confidence interval around END for imprecise variants                                                 
   CIPOS         2      Integer Confidence interval around POS for imprecise variants                                                 
   CS            1      String  Source call set.                                                                                      
   END           1      Integer End coordinate of this variant                                                                        
   IMPRECISE     0      Flag    Imprecise structural variation                                                                        
   MC            .      String  Merged calls.                                                                                         
   MEINFO        4      String  Mobile element info of the form NAME,START,END<POLARITY; If there is only 5' OR 3' support for this...
   MEND          1      Integer Mitochondrial end coordinate of inserted sequence                                                     
   MLEN          1      Integer Estimated length of mitochondrial insert                                                              
   MSTART        1      Integer Mitochondrial start coordinate of inserted sequence                                                   
   SVLEN         .      Integer SV length. It is only calculated for structural variation MEIs. For other types of SVs; one may cal...
   SVTYPE        1      String  Type of structural variant                                                                            
   TSD           1      String  Precise Target Site Duplication for bases, if unknown, value will be NULL                             
   AC            A      Integer Total number of alternate alleles in called genotypes                                                 
   AF            A      Float   Estimated allele frequency in the range (0,1)                                                         
   NS            1      Integer Number of samples with data                                                                           
   AN            1      Integer Total number of alleles in called genotypes                                                           
   EAS_AF        A      Float   Allele frequency in the EAS populations calculated from AC and AN, in the range (0,1)                 
   EUR_AF        A      Float   Allele frequency in the EUR populations calculated from AC and AN, in the range (0,1)                 
   AFR_AF        A      Float   Allele frequency in the AFR populations calculated from AC and AN, in the range (0,1)                 
   AMR_AF        A      Float   Allele frequency in the AMR populations calculated from AC and AN, in the range (0,1)                 
   SAS_AF        A      Float   Allele frequency in the SAS populations calculated from AC and AN, in the range (0,1)                 
   DP            1      Integer Total read depth; only low coverage data were counted towards the DP, exome data were not used        
   AA            1      String  Ancestral Allele. Format: AA|REF|ALT|IndelType. AA: Ancestral allele, REF:Reference Allele, ALT:Alt...
   VT            .      String  indicates what type of variant the line represents                                                    
   EX_TARGET     0      Flag    indicates whether a variant is within the exon pull down target boundaries                            
   MULTI_ALLELIC 0      Flag    indicates whether a site is multi-allelic                                                             
geno(vcf):
  SimpleList of length 1: GT
geno(header(vcf)):
      Number Type   Description
   GT 1      String Genotype   

Now we calculate LD structure relative to our index variant in the european population background. This operation only operates on SNVs and will not return indels or other variants.

You will also notice that the rowRanges of the vcf object has been updated with additional data about each variants.

vcf10q26snps <- CalcLD(vcf = vcf10q26, index = index10q26, population = "EUR")
CalcLD only operates on SNVs some of your variants were dropped at this step, set return = 'all', to override this behaviour
Attaching package: 'Matrix'

The following object is masked from 'package:VariantAnnotation':

    expand

The following object is masked from 'package:S4Vectors':

    expand
vcf10q26snps
class: CollapsedVCF 
dim: 30105 503 
rowRanges(vcf):
  GRanges with 10 metadata columns: ref, alt, refAlleleFreq, altAlleleFreq, HWEpvalue, indexSNP, population, distanceToIndex, D.prime, R.squared
info(vcf):
  DataFrame with 27 columns: CIEND, CIPOS, CS, END, IMPRECISE, MC, MEINFO, MEND, MLEN, MSTART, SVLEN, SVTYPE, TSD, AC, AF, NS, AN, ...
info(header(vcf)):
                 Number Type    Description                                                                                           
   CIEND         2      Integer Confidence interval around END for imprecise variants                                                 
   CIPOS         2      Integer Confidence interval around POS for imprecise variants                                                 
   CS            1      String  Source call set.                                                                                      
   END           1      Integer End coordinate of this variant                                                                        
   IMPRECISE     0      Flag    Imprecise structural variation                                                                        
   MC            .      String  Merged calls.                                                                                         
   MEINFO        4      String  Mobile element info of the form NAME,START,END<POLARITY; If there is only 5' OR 3' support for this...
   MEND          1      Integer Mitochondrial end coordinate of inserted sequence                                                     
   MLEN          1      Integer Estimated length of mitochondrial insert                                                              
   MSTART        1      Integer Mitochondrial start coordinate of inserted sequence                                                   
   SVLEN         .      Integer SV length. It is only calculated for structural variation MEIs. For other types of SVs; one may cal...
   SVTYPE        1      String  Type of structural variant                                                                            
   TSD           1      String  Precise Target Site Duplication for bases, if unknown, value will be NULL                             
   AC            A      Integer Total number of alternate alleles in called genotypes                                                 
   AF            A      Float   Estimated allele frequency in the range (0,1)                                                         
   NS            1      Integer Number of samples with data                                                                           
   AN            1      Integer Total number of alleles in called genotypes                                                           
   EAS_AF        A      Float   Allele frequency in the EAS populations calculated from AC and AN, in the range (0,1)                 
   EUR_AF        A      Float   Allele frequency in the EUR populations calculated from AC and AN, in the range (0,1)                 
   AFR_AF        A      Float   Allele frequency in the AFR populations calculated from AC and AN, in the range (0,1)                 
   AMR_AF        A      Float   Allele frequency in the AMR populations calculated from AC and AN, in the range (0,1)                 
   SAS_AF        A      Float   Allele frequency in the SAS populations calculated from AC and AN, in the range (0,1)                 
   DP            1      Integer Total read depth; only low coverage data were counted towards the DP, exome data were not used        
   AA            1      String  Ancestral Allele. Format: AA|REF|ALT|IndelType. AA: Ancestral allele, REF:Reference Allele, ALT:Alt...
   VT            .      String  indicates what type of variant the line represents                                                    
   EX_TARGET     0      Flag    indicates whether a variant is within the exon pull down target boundaries                            
   MULTI_ALLELIC 0      Flag    indicates whether a site is multi-allelic                                                             
geno(vcf):
  SimpleList of length 1: GT
geno(header(vcf)):
      Number Type   Description
   GT 1      String Genotype   
rowRanges(vcf10q26snps)
GRanges object with 30105 ranges and 10 metadata columns:
              seqnames                 ranges strand |            ref             alt refAlleleFreq altAlleleFreq HWEpvalue
                 <Rle>              <IRanges>  <Rle> | <DNAStringSet> <CharacterList>     <numeric>     <numeric> <numeric>
   rs78692379    chr10 [122837367, 122837367]      * |              C               T      0.999006  0.0009940358 0.9821958
  rs535445408    chr10 [122837368, 122837368]      * |              G               A      1.000000  0.0000000000      <NA>
  rs556048793    chr10 [122837392, 122837392]      * |              T               C      1.000000  0.0000000000      <NA>
  rs139516184    chr10 [122837424, 122837424]      * |              C               T      1.000000  0.0000000000      <NA>
  rs553171117    chr10 [122837426, 122837426]      * |              C               G      0.999006  0.0009940358 0.9821958
          ...      ...                    ...    ... .            ...             ...           ...           ...       ...
  rs530559069    chr10 [123837244, 123837244]      * |              G               T     1.0000000   0.000000000      <NA>
  rs550282246    chr10 [123837263, 123837263]      * |              C               T     1.0000000   0.000000000      <NA>
  rs192908348    chr10 [123837273, 123837273]      * |              C               T     1.0000000   0.000000000      <NA>
  rs539947923    chr10 [123837299, 123837299]      * |              G               A     0.9980119   0.001988072 0.9643651
  rs184426385    chr10 [123837319, 123837319]      * |              G               C     1.0000000   0.000000000      <NA>
                 indexSNP  population distanceToIndex   D.prime    R.squared
              <character> <character>       <integer> <numeric>    <numeric>
   rs78692379   rs2981579         EUR          499968         1 0.0008183719
  rs535445408   rs2981579         EUR          499967      <NA>         <NA>
  rs556048793   rs2981579         EUR          499943      <NA>         <NA>
  rs139516184   rs2981579         EUR          499911      <NA>         <NA>
  rs553171117   rs2981579         EUR          499909         1 0.0008183719
          ...         ...         ...             ...       ...          ...
  rs530559069   rs2981579         EUR          499909      <NA>         <NA>
  rs550282246   rs2981579         EUR          499928      <NA>         <NA>
  rs192908348   rs2981579         EUR          499938      <NA>         <NA>
  rs539947923   rs2981579         EUR          499964         1  0.001638374
  rs184426385   rs2981579         EUR          499984      <NA>         <NA>
  -------
  seqinfo: 1 sequence from hg19 genome

Next we use the “SplitVcfLd” as a convenience function to define our foreground and background sets based on LD. This is a sensible choice since other variants in the same region are likely to have similar properties relative to genomic features like gene Density, gc bias, epigenetic marks etc. As an alternative, we might consider using a resource like SNPsnap to match these properties genome wide.

vcf10q26snps <- SplitVcfLd(vcf = vcf10q26snps, ld = c(metric = "R.squared",
                                                      cutoff = 0.8,
                                                      maf = 0.01))

Functional enrichment by chromatin state

In the next part of our workflow, we want to determine whether the index SNP and its LD proxies, which we defined to be in the “foreground”, are enriched in specific cellular features. As alluded to earlier, the FunciVar package could be used to carry out any analagous operation to find enrichment of one set of features in another. Here we plan to use chromatin state calls from 119 human tissues in Roadmap, which we’ve imported at the beginning of our session. We specify that we are using this type of annotation with the “feature.type” argument (alternative “biofeatures” argument can be used for single element features, e.g. a bed file with H3K27ac peaks). Using the “CalculateEnrichment” function we generate an object containing summary enrichment statistics.

Enrichment calculations

FunciVar by default calculates a likelihood based on the beta-binomial distribution, returning a 95% credible interval (optionally set by the “CI” argument) for the range of differences between the two populations of variants (i.e. foreground and background). Specifically it calculates a distribution of true enrichment (as probability of overlap) for both sets of variants in the genomic features based on the observed number of overlaps:

\[\theta_{fg} ~ Beta(S_{fg} + \alpha, N_{fg} + \beta)\] \[\theta_{bg} ~ Beta(S_{bg} + \alpha, N_{bg} + \beta)\]

for S successes in N trials. FunciVar uses an uninformative Jeffreys prior c(a=0.5, b=0.5) to compare the two distributions directly by subtracting permuted samples to obtain the distribution of differences. The prior can be overidden in special cases (see documentation).

Jeffreys prior and flat prior densities

Jeffreys prior and flat prior densities

To perform these operations with funciVar, we call the “CalculateEnrichment” function.

enrich10q26 <- CalculateEnrichment(variants = vcf10q26snps,
                                   features = esegs,
                                   feature.type = "segmentations", 
                                   CI = 0.8, 
                                   return.overlaps = TRUE)

   |                                                  | 0 % ~calculating  
   |+++++++++                                         | 17% ~03s          
   |+++++++++++++++++                                 | 33% ~02s          
   |+++++++++++++++++++++++++                         | 50% ~01s          
   |++++++++++++++++++++++++++++++++++                | 67% ~01s          
   |++++++++++++++++++++++++++++++++++++++++++        | 83% ~00s          
   |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 02s

At first perhaps we are interested which cell types have the most overlaps with with snps, say among active enhancers.

snps.10q26.enhancers <- enrich10q26$overlaps$`Active Enhancer`
fg.snps.df <- as.data.frame(mcols(snps.10q26.enhancers$foregound.overlaps))
bg.snps.df <- as.data.frame(mcols(snps.10q26.enhancers$background.overlaps))
upset(fg.snps.df, nsets = 7)
upset(bg.snps.df, nsets = 7)

Foreground Background

While it looks like the snps are largely falling within hepatocyte active enhancers, if we examine the enrichment statistics we can see otherwise. Now we’ll extract the enrichment stats and plot the enrichment profile of the variants.

enrich10q26.enrich <- enrich10q26$enrichment
PlotEnrichment(enrich10q26.enrich, 
               value = "difference", 
               block1 = "state", 
               color.by = "sample", 
               ncol = 6)

Annotations of interest

While there is a tendency for SNPs in LD with each other to be enriched in some cellular interest, it is probable that only a subset of them are of functional interest. In our case, we would like to examine the hypothesis that one of our putative functional SNPs disrupts a transcription factor binding site.

Since our index SNP was found in association with breast cancer, and we see enrichment of it’s LD proxies in MCF7, we’ll choose linked variants that overlap Active Enhancer in the MCF7 cell line from ENCODE. Segmentation features live in the overlaps slot of the enrichment object from the previous steps.

snps.10q26.enhancers <- snps.10q26.enhancers$foregound.overlaps[snps.10q26.enhancers$foregound.overlaps$MCF.7 > 0, ]
snps.10q26.enhancers
GRanges object with 20 ranges and 7 metadata columns:
             seqnames                 ranges strand | bipolar.spindle.neuron   GM12878 hepatocyte induced.pluripotent.stem.cell
                <Rle>              <IRanges>  <Rle> |              <integer> <integer>  <integer>                     <integer>
  rs72830627    chr10 [123006362, 123006362]      * |                      0         0          0                             0
   rs1896416    chr10 [123084490, 123084490]      * |                      0         0          1                             0
   rs1078806    chr10 [123338975, 123338975]      * |                      1         0          1                             0
  rs11599804    chr10 [123340664, 123340664]      * |                      1         0          1                             0
   rs1219642    chr10 [123348389, 123348389]      * |                      0         0          0                             0
         ...      ...                    ...    ... .                    ...       ...        ...                           ...
  rs34354213    chr10 [123794212, 123794212]      * |                      0         0          0                             0
  rs11200359    chr10 [123794217, 123794217]      * |                      0         0          0                             0
   rs2459091    chr10 [123819539, 123819539]      * |                      0         0          0                             0
   rs2459090    chr10 [123819715, 123819715]      * |                      0         0          1                             0
   rs2461233    chr10 [123819808, 123819808]      * |                      0         0          1                             0
                 MCF.7 neutrophil     DOHH2
             <integer>  <integer> <integer>
  rs72830627         1          0         0
   rs1896416         1          0         0
   rs1078806         1          0         0
  rs11599804         1          0         0
   rs1219642         1          0         0
         ...       ...        ...       ...
  rs34354213         1          0         0
  rs11200359         1          0         0
   rs2459091         1          0         0
   rs2459090         1          0         0
   rs2461233         1          0         0
  -------
  seqinfo: 1 sequence from hg19 genome

MotifBreakR analysis of TF binding disruption

Now that we have a SNP of interest from breast tissue enhancer, let’s work with the motifBreakR package to determine whether a putative transcription factor binding site (TFBS) is disrupted. MotifBreakR allows us to do this with nothing more than the rsID or a location and allele information. It enables us to use any genome housed in bioconductor BSgenome packages and any set of motifs curated in the MotifDB package (plus some custom databases in MotifDB format that are included with the MotifBreakR installation)1.

We also have some slides explaining how motifbreakR works in addition to the paper and vignette in the footnote indicated above.

Setup

We need to import some version of the human genome and a compatible set of variants from Bioconductor.

library(motifbreakR)
library(BSgenome.Hsapiens.UCSC.hg19)
library(SNPlocs.Hsapiens.dbSNP144.GRCh37)

We need to grab the SNP variant objects out of SNPlocs package that correspond to our enhancer proxies.

snps.mb.10q26.enhancers <- snps.from.rsid(rsid = names(snps.10q26.enhancers),
                                          dbSNP = SNPlocs.Hsapiens.dbSNP144.GRCh37,
                                          search.genome = BSgenome.Hsapiens.UCSC.hg19)
the condition has length > 1 and only the first element will be used
snps.mb.10q26.enhancers
GRanges object with 21 ranges and 4 metadata columns:
              seqnames                 ranges strand |      SNP_id alleles_as_ambig            REF            ALT
                 <Rle>              <IRanges>  <Rle> | <character>   <DNAStringSet> <DNAStringSet> <DNAStringSet>
   rs10736308    chr10 [123601849, 123601849]      * |  rs10736308                S              G              C
    rs1078806    chr10 [123338975, 123338975]      * |   rs1078806                R              A              G
   rs11200358    chr10 [123794209, 123794209]      * |  rs11200358                K              G              T
   rs11200359    chr10 [123794217, 123794217]      * |  rs11200359                Y              T              C
  rs113012332    chr10 [123747830, 123747830]      * | rs113012332                Y              C              T
          ...      ...                    ...    ... .         ...              ...            ...            ...
    rs2981584    chr10 [123350216, 123350216]      * |   rs2981584                M              A              C
   rs34354213    chr10 [123794212, 123794212]      * |  rs34354213                Y              C              T
    rs3862115    chr10 [123793061, 123793061]      * |   rs3862115                Y              T              C
   rs72830627    chr10 [123006362, 123006362]      * |  rs72830627                R              A              G
   rs76133150    chr10 [123747500, 123747500]      * |  rs76133150                R              A              G
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Analysis

Now a simple call to the motifbreakR function completes the analysis. Here, we specify that we want to filter our results by p-value (rather than scaled motif scores), we’ll set that p-value threshold at 10^-4, using information content (set argument method = "ic") and a uniform prior on the GC content.

results.10q26.enhancers <- motifbreakR(snpList = snps.mb.10q26.enhancers,
                                       filterp = TRUE,
                                       pwmList = hocomoco,
                                       threshold = 1e-4,
                                       method = "ic",
                                       bkg = c(A=0.25, C=0.25, G=0.25, T=0.25),
                                       BPPARAM = BiocParallel::bpparam())
'*' ranges were treated as '+''*' ranges were treated as '+''*' ranges were treated as '+''*' ranges were treated as '+''*' ranges were treated as '+''*' ranges were treated as '+'
results.10q26.enhancers
GRanges object with 39 ranges and 18 metadata columns:
             seqnames                 ranges strand |            REF            ALT    snpPos  motifPos  geneSymbol  dataSource
                <Rle>              <IRanges>  <Rle> | <DNAStringSet> <DNAStringSet> <integer> <integer> <character> <character>
   rs1219642    chr10 [123348377, 123348389]      - |              T              C 123348389         1       CEBPG    HOCOMOCO
   rs2459091    chr10 [123819533, 123819546]      + |              C              T 123819539         7        ELF3    HOCOMOCO
   rs2459091    chr10 [123819537, 123819548]      + |              C              T 123819539         3        ELK3    HOCOMOCO
   rs2981583    chr10 [123350516, 123350528]      - |              A              G 123350523         6       EPAS1    HOCOMOCO
   rs2981584    chr10 [123350207, 123350218]      - |              A              C 123350216         3       FUBP1    HOCOMOCO
         ...      ...                    ...    ... .            ...            ...       ...       ...         ...         ...
   rs2981584    chr10 [123350202, 123350218]      + |              A              C 123350216        15        TLX1    HOCOMOCO
  rs11200358    chr10 [123794197, 123794218]      + |              G              T 123794209        13      ZBTB7B    HOCOMOCO
  rs11200358    chr10 [123794199, 123794217]      + |              G              T 123794209        11         ZFX    HOCOMOCO
  rs34354213    chr10 [123794199, 123794217]      + |              C              T 123794212        14         ZFX    HOCOMOCO
  rs11200358    chr10 [123794198, 123794212]      - |              G              T 123794209         4      ZNF423    HOCOMOCO
             providerName  providerId                                seqMatch    pctRef    pctAlt  scoreRef  scoreAlt Refpvalue
              <character> <character>                             <character> <numeric> <numeric> <numeric> <numeric> <logical>
   rs1219642     CEBPG_si CEBPG_HUMAN   acaattgcaccaT                         0.9326388 0.9131911  8.110324  7.941205      <NA>
   rs2459091      ELF3_f1  ELF3_HUMAN         gtgataCaggaact                  0.9240643 0.8269081  9.003045  8.056464      <NA>
   rs2459091      ELK3_f1  ELK3_HUMAN             taCaggaactgc                0.9415539 0.8235314  6.955565  6.083694      <NA>
   rs2981583     EPAS1_si EPAS1_HUMAN        cacacgcAcactc                    0.9410674 0.8898237  7.272689  6.876672      <NA>
   rs2981584     FUBP1_f1 FUBP1_HUMAN      acaaagccaAga                       0.7012659 0.9244287  5.395118  7.111998      <NA>
         ...          ...         ...                                     ...       ...       ...       ...       ...       ...
   rs2981584      TLX1_f2  TLX1_HUMAN attgcacaaagccaAga                       0.9011307 0.8206667  4.555611  4.148830      <NA>
  rs11200358     ZBT7B_si ZBT7B_HUMAN   agcactttgggaGgccgaggtg                0.8119717 0.7656583  9.442073  8.903514      <NA>
  rs11200358       ZFX_f1   ZFX_HUMAN     cactttgggaGgccgaggt                 0.9352668 0.7682406 10.638093  8.738271      <NA>
  rs34354213       ZFX_f1   ZFX_HUMAN  cactttgggaggcCgaggt                    0.9352668 0.8312528 10.638093  9.454997      <NA>
  rs11200358     ZN423_f1 ZN423_HUMAN    gcactttgggaGgcc                      0.6739751 0.7503685 14.692071 16.357381      <NA>
             Altpvalue  alleleRef  alleleAlt      effect
             <logical>  <numeric>  <numeric> <character>
   rs1219642      <NA> 0.60000000  0.1485714        weak
   rs2459091      <NA> 0.86507937  0.0000000      strong
   rs2459091      <NA> 0.90566038  0.0000000      strong
   rs2981583      <NA> 0.73914182  0.1676945        weak
   rs2981584      <NA> 0.01904762  0.9809524      strong
         ...       ...        ...        ...         ...
   rs2981584      <NA>  0.6653994 0.01631485        weak
  rs11200358      <NA>  0.7443182 0.00000000      strong
  rs11200358      <NA>  0.9929020 0.00000000      strong
  rs34354213      <NA>  0.8850690 0.02484316      strong
  rs11200358      <NA>  0.0000000 1.00000000      strong
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

MotifbreakR does a quick estimate of the p-value threshold initially. To obtain accurate p-values, we run “calculatePvalue” in a seperate step (beware this can be slow!).

pvalue.results.10q26.enhancers <- calculatePvalue(results = results.10q26.enhancers)

Visualization

MotifbreakR includes a plotting function to visualize the SNP in its genomic context, aligned with the motif logos it is purported to alter.

plotMB(results.10q26.enhancers, "rs34354213", effect = "strong")

Here we can see that the variant disrupts a Pax-5 motif which is interesting given the results in this paper Breast Cancer Malignant Processes are Regulated by Pax-5 Through the Disruption of FAK Signaling Pathways


  1. see the manuscript and vignette for details.

---
title: "Variant Annotation Workshop with FunciVAR, StateHub and MotifBreakR"
author: "Simon G. Coetzee & Dennis J. Hazelett"
date: "`r Sys.Date()`"
output: 
  html_notebook:
    highlight: tango
    theme: journal
    toc: yes
    toc_float:
      collapsed: false

vignette: >
  %\VignetteIndexEntry{FunciVar, MotifBreakR and StatePaintR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# FuncVAR: Annotation and functional enrichment of variant sets

In this vignette we will learn to take a set of variants, arbitrarily defined as any set of GenomicRanges, overlap with multiomics datasets and then calculate enrichment against an appropriately chosen background set.

## Setup

First we need to import the packages we'll be using. FunciVar can be accessed directly from our github repositories via biocLite. GenomicRanges and VariantAnnotation packages are available through bioconductor:

Install the required packages, if they are not already availible.
```{r, message = FALSE, eval = FALSE}
source("https://bioconductor.org/biocLite.R")
biocLite("Simon-Coetzee/StatePaintR", suppressUpdates = TRUE, dependencies = TRUE)
biocLite("Simon-Coetzee/funciVar", suppressUpdates = TRUE, dependencies = TRUE)
biocLite("Simon-Coetzee/motifbreakR", suppressUpdates = TRUE, dependencies = TRUE)
biocLite("SNPlocs.Hsapiens.dbSNP144.GRCh37")
```
Then load the packages we'll be using in this workshop.
```{r, message = FALSE}
library(StatePaintR)
library(funciVar)
library(GenomicRanges)
library(VariantAnnotation)
library(UpSetR)
```

## Getting annotations

Annotations can come from any source, the key requirement be that they be coerced into GRanges format for use by the funciVar package. For our purposes, We can search on StateHub for all cell types that are annotated in **hg19** with segmentations combining at least, **H3K4me1, H3K4me3, H3K27ac, and CTCF**.

![*StateHub Search for hg19, H3K4me1, H3K4me3, H3K27ac, and CTCF*](statehubsearch.png)

Each segmentation file contains a set of chromatin state calls made by our StatePaintR package.

### How does StatePaintR work?
Lets look at *StatePaintR*: [slides](http://goo.gl/wvPVbh) and [rmarkdown](http://www.statehub.org/statehub_media/statepaintr.nb.html)

### Returning to Analysis
FunciVar imports these files efficiently by pointing a function called "GetSegmentations" directly. `GetSegmentations()` expects files to be in [BED](https://genome.ucsc.edu/FAQ/FAQformat#format1) format with column 4 (Name) corresponding to the state call. Then it's just a matter of renaming the state abbreviations to easily recognized, human readable form.

```{r}
## Base url of StateHub files in our search

statehub.encode.aws <- "http://s3-us-west-2.amazonaws.com/statehub-trackhub/tracks/5813b67f46e0fb06b493ceb0/hg19/ENCODE/"

## A list of StateHub urls

segmentation.files <- c(
  paste0(statehub.encode.aws,
         "bipolar_spindle_neuron.8mark.segmentation.bed"),
  paste0(statehub.encode.aws, 
         "dohh2.8mark.segmentation.bed"),
  paste0(statehub.encode.aws, 
         "gm12878.11mark.segmentation.bed"),
  paste0(statehub.encode.aws,
         "hepatocyte.9mark.segmentation.bed"),
  paste0(statehub.encode.aws,
         "induced_pluripotent_stem_cell.7mark.segmentation.bed"),
  paste0(statehub.encode.aws, 
         "mcf-7.12mark.segmentation.bed"),
  paste0(statehub.encode.aws,
         "neutrophil.8mark.segmentation.bed"))

## Import States

encode.segmentations <- GetSegmentations(files = segmentation.files)
esegs <- unlist(encode.segmentations)
esegs

## Group States in Human readable collections.

mcols(esegs[esegs$state %in% c("EAR", "AR"), ])$state <- "Active Enhancer"
mcols(esegs[esegs$state %in% c("EPR", "EWR"), ])$state <- "Weak Enhancer"
mcols(esegs[esegs$state %in% c("PAR"), ])$state <- "Active Promoter"
mcols(esegs[esegs$state %in% c("PPR", "PPWR", "PWR"), ])$state <- "Weak Promoter"
mcols(esegs[esegs$state %in% c("CTCF"), ])$state <- "CTCF"
mcols(esegs[esegs$state %in% c("HET", "SCR", "TRS"), ])$state <- "Other"
```

## Importing Variants

There are two potential use cases that we envision for funciVar. In the first case, our enrichment calculations are based on a foreground and background set that are predefined and nothing needs to be done outside of importing to R and coercing into GRanges. Such foreground sets could include any kind of variant; common population polymorphisms or differentially methylated CpGs for example. 

In the second case, especially common for primary GWAS studies, a single top "hit" has been identified and a set of variants known to be in linkage disequilibrium (LD) should be calculated. We will cover this latter case. The LD calculations can be used to further subdivide regional variants into foreground and background based on some LD threshold (typically at R&sup2; = 0.8). In other instances, more creative methods of obtaining a background set might be warranted.

We will use 1000 genomes data for our LD calculations, but in principle we could apply this to any population stored in VCF. First we need to identify the file containing the 1000 and assign all the sample metadata into a variable called "my.samples".

```{r}
my.samples <- read.delim("http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel", stringsAsFactors = FALSE)
my.samples <- my.samples[, c("sample", "pop", "super_pop", "gender")]
my.samples
```

Next we will identify the file with chromosome 10 variants and define the index region as a filter for the foreground variant set. For simplicity's sake, we are testing a single region, but we could easily analyze multiple ranges. For this example, we are using a GWAS variant linked to breast cancer on chromosome 10.

```{r}
chr10.remote.vcf <- "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
index10q26 <- "rs2981579"
pos10q26 <- GRanges("10:123337335-123337335") + 500000
pos10q26
```

Using the 1MB window that we defined for the index variant, we capture the relevant variants from the VCF file for our LD calculations.

```{r, eval=FALSE}
vcf10q26 <- GetVariantsInWindow(file = chr10.remote.vcf, 
                                position = pos10q26, 
                                type = "vcf")
```

EBI throttles downloads from here pretty hard. So We have a local version of the vcf included also, in case you get the above error.

```{r}
chr10.local.vcf <- system.file("extdata",
                               "1000genomes.chr10.122836335-123838335.vcf.gz",
                               package="bioc2017")

vcf10q26 <- GetVariantsInWindow(file = chr10.local.vcf, 
                                position = pos10q26, 
                                type = "vcf")
```

To ensure that we make our LD calculation from the correct population structure, we set the population using metadata from "my.samples".

```{r}
vcf10q26 <- SetPopulation(vcf = vcf10q26, sample_sheet = my.samples)
vcf10q26
```

Now we calculate LD structure relative to our index variant in the european population background. This operation only operates on SNVs and will not return indels or other variants.

You will also notice that the rowRanges of the vcf object has been updated with additional data about each variants.

```{r}
vcf10q26snps <- CalcLD(vcf = vcf10q26, index = index10q26, population = "EUR")
vcf10q26snps
rowRanges(vcf10q26snps)
```

Next we use the "SplitVcfLd" as a convenience function to define our foreground and background sets based on LD. This is a sensible choice since other variants in the same region are likely to have similar properties relative to genomic features like gene Density, gc bias, epigenetic marks etc. As an alternative, we might consider using a resource like SNPsnap to match these properties genome wide.

```{r}
vcf10q26snps <- SplitVcfLd(vcf = vcf10q26snps, ld = c(metric = "R.squared",
                                                      cutoff = 0.8,
                                                      maf = 0.01))
```

## Functional enrichment by chromatin state

In the next part of our workflow, we want to determine whether the index SNP and its LD proxies, which we defined to be in the "foreground", are enriched in specific cellular features. As alluded to earlier, the FunciVar package could be used to carry out any analagous operation to find enrichment of one set of features in another. Here we plan to use chromatin state calls from 119 human tissues in Roadmap, which we've imported at the beginning of our session. We specify that we are using this type of annotation with the "feature.type" argument (alternative "biofeatures" argument can be used for single element features, _e.g._ a bed file with H3K27ac peaks). Using the "CalculateEnrichment" function we generate an object containing summary enrichment statistics.

### Enrichment calculations

FunciVar by default calculates a likelihood based on the beta-binomial distribution, returning a 95% credible interval (optionally set by the "CI" argument) for the range of differences between the two populations of variants (_i.e._ foreground and background). Specifically it calculates a distribution of true enrichment (as probability of overlap) for both sets of variants in the genomic features based on the observed number of overlaps:

$$\theta_{fg} ~ Beta(S_{fg} + \alpha, N_{fg} + \beta)$$
$$\theta_{bg} ~ Beta(S_{bg} + \alpha, N_{bg} + \beta)$$

for S successes in N trials. FunciVar uses an uninformative [Jeffreys prior](https://en.wikipedia.org/wiki/Jeffreys_prior) c(a=0.5, b=0.5) to compare the two distributions directly by subtracting permuted samples to obtain the distribution of differences. The prior can be overidden in special cases (see documentation).

![*Jeffreys prior and flat prior densities*](jeffreysprior.png)

To perform these operations with funciVar, we call the "CalculateEnrichment" function.

```{r}
enrich10q26 <- CalculateEnrichment(variants = vcf10q26snps,
                                   features = esegs,
                                   feature.type = "segmentations", 
                                   CI = 0.8, 
                                   return.overlaps = TRUE)
```

At first perhaps we are interested which cell types have the most overlaps with with snps, say among active enhancers.

```{r}
snps.10q26.enhancers <- enrich10q26$overlaps$`Active Enhancer`
fg.snps.df <- as.data.frame(mcols(snps.10q26.enhancers$foregound.overlaps))
bg.snps.df <- as.data.frame(mcols(snps.10q26.enhancers$background.overlaps))
```
```{r, eval=FALSE}
upset(fg.snps.df, nsets = 7)
upset(bg.snps.df, nsets = 7)
```
**Foreground**
![](foreground.upsetr.png)
**Background**
![](background.upsetr.png)

While it looks like the snps are largely falling within hepatocyte active enhancers, if we examine the enrichment statistics we can see otherwise. Now we'll extract the enrichment stats and plot the enrichment profile of the variants.

```{r, fig.width=6}
enrich10q26.enrich <- enrich10q26$enrichment
PlotEnrichment(enrich10q26.enrich, 
               value = "difference", 
               block1 = "state", 
               color.by = "sample", 
               ncol = 6)
```

### Annotations of interest

While there is a tendency for SNPs in LD with each other to be enriched in some cellular interest, it is probable that only a subset of them are of functional interest. In our case, we would like to examine the hypothesis that one of our putative functional SNPs disrupts a transcription factor binding site. 

Since our index SNP was found in association with breast cancer, and we see enrichment of it's LD proxies in MCF7, we'll choose linked variants that overlap Active Enhancer in the MCF7 cell line from ENCODE. Segmentation features live in the overlaps slot of the enrichment object from the previous steps.

```{r}
snps.10q26.enhancers <- snps.10q26.enhancers$foregound.overlaps[snps.10q26.enhancers$foregound.overlaps$MCF.7 > 0, ]
snps.10q26.enhancers
```

# MotifBreakR analysis of TF binding disruption

Now that we have a SNP of interest from breast tissue enhancer, let's work with the motifBreakR package to determine whether a putative transcription factor binding site (TFBS) is disrupted. MotifBreakR allows us to do this with nothing more than the rsID or a location and allele information. It enables us to use any genome housed in bioconductor BSgenome packages and any set of motifs curated in the MotifDB package (plus some custom databases in MotifDB format that are included with the MotifBreakR installation)[^1].

We also have some slides explaining how [motifbreakR works](http://goo.gl/PJQb4b) in addition to the paper and vignette in the footnote indicated above.

[^1]: see the [manuscript](https://academic.oup.com/bioinformatics/article/31/23/3847/209440/motifbreakR-an-R-Bioconductor-package-for) and [vignette](https://bioconductor.org/packages/release/bioc/vignettes/motifbreakR/inst/doc/motifbreakR-vignette.html) for details.

## Setup

We need to import some version of the human genome and a compatible set of variants from Bioconductor.

```{r}
library(motifbreakR)
library(BSgenome.Hsapiens.UCSC.hg19)
library(SNPlocs.Hsapiens.dbSNP144.GRCh37)
```

We need to grab the SNP variant objects out of SNPlocs package that correspond to our enhancer proxies.

```{r, warnings = FALSE}
snps.mb.10q26.enhancers <- snps.from.rsid(rsid = names(snps.10q26.enhancers),
                                          dbSNP = SNPlocs.Hsapiens.dbSNP144.GRCh37,
                                          search.genome = BSgenome.Hsapiens.UCSC.hg19)
snps.mb.10q26.enhancers
```

## Analysis

Now a simple call to the motifbreakR function completes the analysis. Here, we specify that we want to filter our results by p-value (rather than scaled motif scores), we'll set that p-value threshold at 10^-4, using information content (set argument `method = "ic"`) and a uniform prior on the GC content.

```{r}
results.10q26.enhancers <- motifbreakR(snpList = snps.mb.10q26.enhancers,
                                       filterp = TRUE,
                                       pwmList = hocomoco,
                                       threshold = 1e-4,
                                       method = "ic",
                                       bkg = c(A=0.25, C=0.25, G=0.25, T=0.25),
                                       BPPARAM = BiocParallel::bpparam())
results.10q26.enhancers
```

MotifbreakR does a quick estimate of the p-value threshold initially. To obtain accurate p-values, we run "calculatePvalue" in a seperate step (beware this can be slow!).

```{r, eval=FALSE}
pvalue.results.10q26.enhancers <- calculatePvalue(results = results.10q26.enhancers)
```

## Visualization

MotifbreakR includes a plotting function to visualize the SNP in its genomic context, aligned with the motif logos it is purported to alter.

```{r, fig.width=3, fig.height=5}
plotMB(results.10q26.enhancers, "rs34354213", effect = "strong")
```

Here we can see that the variant disrupts a Pax-5 motif which is interesting given the results in this paper [*Breast Cancer Malignant Processes are Regulated by Pax-5 Through the Disruption of FAK Signaling Pathways*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5219892/)