In this vignette we will learn to take a set of variants, arbitrarily defined as any set of GenomicRanges, overlap with multiomics datasets and then calculate enrichment against an appropriately chosen background set.
First we need to import the packages we’ll be using. FunciVar can be accessed directly from our github repositories via biocLite. GenomicRanges and VariantAnnotation packages are available through bioconductor:
Install the required packages, if they are not already availible.
source("https://bioconductor.org/biocLite.R")
biocLite("Simon-Coetzee/StatePaintR", suppressUpdates = TRUE, dependencies = TRUE)
biocLite("Simon-Coetzee/funciVar", suppressUpdates = TRUE, dependencies = TRUE)
biocLite("Simon-Coetzee/motifbreakR", suppressUpdates = TRUE, dependencies = TRUE)
biocLite("SNPlocs.Hsapiens.dbSNP144.GRCh37")
Then load the packages we’ll be using in this workshop.
library(StatePaintR)
library(funciVar)
library(GenomicRanges)
library(VariantAnnotation)
library(UpSetR)
Annotations can come from any source, the key requirement be that they be coerced into GRanges format for use by the funciVar package. For our purposes, We can search on StateHub for all cell types that are annotated in hg19 with segmentations combining at least, H3K4me1, H3K4me3, H3K27ac, and CTCF.
Each segmentation file contains a set of chromatin state calls made by our StatePaintR package.
FunciVar imports these files efficiently by pointing a function called “GetSegmentations” directly. GetSegmentations()
expects files to be in BED format with column 4 (Name) corresponding to the state call. Then it’s just a matter of renaming the state abbreviations to easily recognized, human readable form.
## Base url of StateHub files in our search
statehub.encode.aws <- "http://s3-us-west-2.amazonaws.com/statehub-trackhub/tracks/5813b67f46e0fb06b493ceb0/hg19/ENCODE/"
## A list of StateHub urls
segmentation.files <- c(
paste0(statehub.encode.aws,
"bipolar_spindle_neuron.8mark.segmentation.bed"),
paste0(statehub.encode.aws,
"dohh2.8mark.segmentation.bed"),
paste0(statehub.encode.aws,
"gm12878.11mark.segmentation.bed"),
paste0(statehub.encode.aws,
"hepatocyte.9mark.segmentation.bed"),
paste0(statehub.encode.aws,
"induced_pluripotent_stem_cell.7mark.segmentation.bed"),
paste0(statehub.encode.aws,
"mcf-7.12mark.segmentation.bed"),
paste0(statehub.encode.aws,
"neutrophil.8mark.segmentation.bed"))
## Import States
encode.segmentations <- GetSegmentations(files = segmentation.files)
esegs <- unlist(encode.segmentations)
esegs
GRanges object with 4400992 ranges and 2 metadata columns:
seqnames ranges strand | sample state
<Rle> <IRanges> <Rle> | <character> <character>
bipolar spindle neuron chr1 [ 54343, 54728] * | bipolar spindle neuron SCR
bipolar spindle neuron chr1 [ 83693, 84250] * | bipolar spindle neuron SCR
bipolar spindle neuron chr1 [ 85703, 86097] * | bipolar spindle neuron SCR
bipolar spindle neuron chr1 [ 89735, 90145] * | bipolar spindle neuron SCR
bipolar spindle neuron chr1 [713064, 713592] * | bipolar spindle neuron PWR
... ... ... ... . ... ...
neutrophil chrY [59023293, 59023989] * | neutrophil HET
neutrophil chrY [59026316, 59026627] * | neutrophil HET
neutrophil chrY [59029021, 59029372] * | neutrophil HET
neutrophil chrM [ 37, 405] * | neutrophil AR
neutrophil chrM [ 15931, 16269] * | neutrophil AR
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
## Group States in Human readable collections.
mcols(esegs[esegs$state %in% c("EAR", "AR"), ])$state <- "Active Enhancer"
mcols(esegs[esegs$state %in% c("EPR", "EWR"), ])$state <- "Weak Enhancer"
mcols(esegs[esegs$state %in% c("PAR"), ])$state <- "Active Promoter"
mcols(esegs[esegs$state %in% c("PPR", "PPWR", "PWR"), ])$state <- "Weak Promoter"
mcols(esegs[esegs$state %in% c("CTCF"), ])$state <- "CTCF"
mcols(esegs[esegs$state %in% c("HET", "SCR", "TRS"), ])$state <- "Other"
There are two potential use cases that we envision for funciVar. In the first case, our enrichment calculations are based on a foreground and background set that are predefined and nothing needs to be done outside of importing to R and coercing into GRanges. Such foreground sets could include any kind of variant; common population polymorphisms or differentially methylated CpGs for example.
In the second case, especially common for primary GWAS studies, a single top “hit” has been identified and a set of variants known to be in linkage disequilibrium (LD) should be calculated. We will cover this latter case. The LD calculations can be used to further subdivide regional variants into foreground and background based on some LD threshold (typically at R² = 0.8). In other instances, more creative methods of obtaining a background set might be warranted.
We will use 1000 genomes data for our LD calculations, but in principle we could apply this to any population stored in VCF. First we need to identify the file containing the 1000 and assign all the sample metadata into a variable called “my.samples”.
my.samples <- read.delim("http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel", stringsAsFactors = FALSE)
my.samples <- my.samples[, c("sample", "pop", "super_pop", "gender")]
my.samples
Next we will identify the file with chromosome 10 variants and define the index region as a filter for the foreground variant set. For simplicity’s sake, we are testing a single region, but we could easily analyze multiple ranges. For this example, we are using a GWAS variant linked to breast cancer on chromosome 10.
chr10.remote.vcf <- "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr10.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
index10q26 <- "rs2981579"
pos10q26 <- GRanges("10:123337335-123337335") + 500000
pos10q26
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 10 [122837335, 123837335] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Using the 1MB window that we defined for the index variant, we capture the relevant variants from the VCF file for our LD calculations.
vcf10q26 <- GetVariantsInWindow(file = chr10.remote.vcf,
position = pos10q26,
type = "vcf")
EBI throttles downloads from here pretty hard. So We have a local version of the vcf included also, in case you get the above error.
chr10.local.vcf <- system.file("extdata",
"1000genomes.chr10.122836335-123838335.vcf.gz",
package="bioc2017")
vcf10q26 <- GetVariantsInWindow(file = chr10.local.vcf,
position = pos10q26,
type = "vcf")
To ensure that we make our LD calculation from the correct population structure, we set the population using metadata from “my.samples”.
vcf10q26 <- SetPopulation(vcf = vcf10q26, sample_sheet = my.samples)
vcf10q26
class: CollapsedVCF
dim: 31499 2504
rowRanges(vcf):
GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
DataFrame with 27 columns: CIEND, CIPOS, CS, END, IMPRECISE, MC, MEINFO, MEND, MLEN, MSTART, SVLEN, SVTYPE, TSD, AC, AF, NS, AN, ...
info(header(vcf)):
Number Type Description
CIEND 2 Integer Confidence interval around END for imprecise variants
CIPOS 2 Integer Confidence interval around POS for imprecise variants
CS 1 String Source call set.
END 1 Integer End coordinate of this variant
IMPRECISE 0 Flag Imprecise structural variation
MC . String Merged calls.
MEINFO 4 String Mobile element info of the form NAME,START,END<POLARITY; If there is only 5' OR 3' support for this...
MEND 1 Integer Mitochondrial end coordinate of inserted sequence
MLEN 1 Integer Estimated length of mitochondrial insert
MSTART 1 Integer Mitochondrial start coordinate of inserted sequence
SVLEN . Integer SV length. It is only calculated for structural variation MEIs. For other types of SVs; one may cal...
SVTYPE 1 String Type of structural variant
TSD 1 String Precise Target Site Duplication for bases, if unknown, value will be NULL
AC A Integer Total number of alternate alleles in called genotypes
AF A Float Estimated allele frequency in the range (0,1)
NS 1 Integer Number of samples with data
AN 1 Integer Total number of alleles in called genotypes
EAS_AF A Float Allele frequency in the EAS populations calculated from AC and AN, in the range (0,1)
EUR_AF A Float Allele frequency in the EUR populations calculated from AC and AN, in the range (0,1)
AFR_AF A Float Allele frequency in the AFR populations calculated from AC and AN, in the range (0,1)
AMR_AF A Float Allele frequency in the AMR populations calculated from AC and AN, in the range (0,1)
SAS_AF A Float Allele frequency in the SAS populations calculated from AC and AN, in the range (0,1)
DP 1 Integer Total read depth; only low coverage data were counted towards the DP, exome data were not used
AA 1 String Ancestral Allele. Format: AA|REF|ALT|IndelType. AA: Ancestral allele, REF:Reference Allele, ALT:Alt...
VT . String indicates what type of variant the line represents
EX_TARGET 0 Flag indicates whether a variant is within the exon pull down target boundaries
MULTI_ALLELIC 0 Flag indicates whether a site is multi-allelic
geno(vcf):
SimpleList of length 1: GT
geno(header(vcf)):
Number Type Description
GT 1 String Genotype
Now we calculate LD structure relative to our index variant in the european population background. This operation only operates on SNVs and will not return indels or other variants.
You will also notice that the rowRanges of the vcf object has been updated with additional data about each variants.
vcf10q26snps <- CalcLD(vcf = vcf10q26, index = index10q26, population = "EUR")
CalcLD only operates on SNVs some of your variants were dropped at this step, set return = 'all', to override this behaviour
Attaching package: 'Matrix'
The following object is masked from 'package:VariantAnnotation':
expand
The following object is masked from 'package:S4Vectors':
expand
vcf10q26snps
class: CollapsedVCF
dim: 30105 503
rowRanges(vcf):
GRanges with 10 metadata columns: ref, alt, refAlleleFreq, altAlleleFreq, HWEpvalue, indexSNP, population, distanceToIndex, D.prime, R.squared
info(vcf):
DataFrame with 27 columns: CIEND, CIPOS, CS, END, IMPRECISE, MC, MEINFO, MEND, MLEN, MSTART, SVLEN, SVTYPE, TSD, AC, AF, NS, AN, ...
info(header(vcf)):
Number Type Description
CIEND 2 Integer Confidence interval around END for imprecise variants
CIPOS 2 Integer Confidence interval around POS for imprecise variants
CS 1 String Source call set.
END 1 Integer End coordinate of this variant
IMPRECISE 0 Flag Imprecise structural variation
MC . String Merged calls.
MEINFO 4 String Mobile element info of the form NAME,START,END<POLARITY; If there is only 5' OR 3' support for this...
MEND 1 Integer Mitochondrial end coordinate of inserted sequence
MLEN 1 Integer Estimated length of mitochondrial insert
MSTART 1 Integer Mitochondrial start coordinate of inserted sequence
SVLEN . Integer SV length. It is only calculated for structural variation MEIs. For other types of SVs; one may cal...
SVTYPE 1 String Type of structural variant
TSD 1 String Precise Target Site Duplication for bases, if unknown, value will be NULL
AC A Integer Total number of alternate alleles in called genotypes
AF A Float Estimated allele frequency in the range (0,1)
NS 1 Integer Number of samples with data
AN 1 Integer Total number of alleles in called genotypes
EAS_AF A Float Allele frequency in the EAS populations calculated from AC and AN, in the range (0,1)
EUR_AF A Float Allele frequency in the EUR populations calculated from AC and AN, in the range (0,1)
AFR_AF A Float Allele frequency in the AFR populations calculated from AC and AN, in the range (0,1)
AMR_AF A Float Allele frequency in the AMR populations calculated from AC and AN, in the range (0,1)
SAS_AF A Float Allele frequency in the SAS populations calculated from AC and AN, in the range (0,1)
DP 1 Integer Total read depth; only low coverage data were counted towards the DP, exome data were not used
AA 1 String Ancestral Allele. Format: AA|REF|ALT|IndelType. AA: Ancestral allele, REF:Reference Allele, ALT:Alt...
VT . String indicates what type of variant the line represents
EX_TARGET 0 Flag indicates whether a variant is within the exon pull down target boundaries
MULTI_ALLELIC 0 Flag indicates whether a site is multi-allelic
geno(vcf):
SimpleList of length 1: GT
geno(header(vcf)):
Number Type Description
GT 1 String Genotype
rowRanges(vcf10q26snps)
GRanges object with 30105 ranges and 10 metadata columns:
seqnames ranges strand | ref alt refAlleleFreq altAlleleFreq HWEpvalue
<Rle> <IRanges> <Rle> | <DNAStringSet> <CharacterList> <numeric> <numeric> <numeric>
rs78692379 chr10 [122837367, 122837367] * | C T 0.999006 0.0009940358 0.9821958
rs535445408 chr10 [122837368, 122837368] * | G A 1.000000 0.0000000000 <NA>
rs556048793 chr10 [122837392, 122837392] * | T C 1.000000 0.0000000000 <NA>
rs139516184 chr10 [122837424, 122837424] * | C T 1.000000 0.0000000000 <NA>
rs553171117 chr10 [122837426, 122837426] * | C G 0.999006 0.0009940358 0.9821958
... ... ... ... . ... ... ... ... ...
rs530559069 chr10 [123837244, 123837244] * | G T 1.0000000 0.000000000 <NA>
rs550282246 chr10 [123837263, 123837263] * | C T 1.0000000 0.000000000 <NA>
rs192908348 chr10 [123837273, 123837273] * | C T 1.0000000 0.000000000 <NA>
rs539947923 chr10 [123837299, 123837299] * | G A 0.9980119 0.001988072 0.9643651
rs184426385 chr10 [123837319, 123837319] * | G C 1.0000000 0.000000000 <NA>
indexSNP population distanceToIndex D.prime R.squared
<character> <character> <integer> <numeric> <numeric>
rs78692379 rs2981579 EUR 499968 1 0.0008183719
rs535445408 rs2981579 EUR 499967 <NA> <NA>
rs556048793 rs2981579 EUR 499943 <NA> <NA>
rs139516184 rs2981579 EUR 499911 <NA> <NA>
rs553171117 rs2981579 EUR 499909 1 0.0008183719
... ... ... ... ... ...
rs530559069 rs2981579 EUR 499909 <NA> <NA>
rs550282246 rs2981579 EUR 499928 <NA> <NA>
rs192908348 rs2981579 EUR 499938 <NA> <NA>
rs539947923 rs2981579 EUR 499964 1 0.001638374
rs184426385 rs2981579 EUR 499984 <NA> <NA>
-------
seqinfo: 1 sequence from hg19 genome
Next we use the “SplitVcfLd” as a convenience function to define our foreground and background sets based on LD. This is a sensible choice since other variants in the same region are likely to have similar properties relative to genomic features like gene Density, gc bias, epigenetic marks etc. As an alternative, we might consider using a resource like SNPsnap to match these properties genome wide.
vcf10q26snps <- SplitVcfLd(vcf = vcf10q26snps, ld = c(metric = "R.squared",
cutoff = 0.8,
maf = 0.01))
In the next part of our workflow, we want to determine whether the index SNP and its LD proxies, which we defined to be in the “foreground”, are enriched in specific cellular features. As alluded to earlier, the FunciVar package could be used to carry out any analagous operation to find enrichment of one set of features in another. Here we plan to use chromatin state calls from 119 human tissues in Roadmap, which we’ve imported at the beginning of our session. We specify that we are using this type of annotation with the “feature.type” argument (alternative “biofeatures” argument can be used for single element features, e.g. a bed file with H3K27ac peaks). Using the “CalculateEnrichment” function we generate an object containing summary enrichment statistics.
FunciVar by default calculates a likelihood based on the beta-binomial distribution, returning a 95% credible interval (optionally set by the “CI” argument) for the range of differences between the two populations of variants (i.e. foreground and background). Specifically it calculates a distribution of true enrichment (as probability of overlap) for both sets of variants in the genomic features based on the observed number of overlaps:
\[\theta_{fg} ~ Beta(S_{fg} + \alpha, N_{fg} + \beta)\] \[\theta_{bg} ~ Beta(S_{bg} + \alpha, N_{bg} + \beta)\]
for S successes in N trials. FunciVar uses an uninformative Jeffreys prior c(a=0.5, b=0.5) to compare the two distributions directly by subtracting permuted samples to obtain the distribution of differences. The prior can be overidden in special cases (see documentation).
To perform these operations with funciVar, we call the “CalculateEnrichment” function.
enrich10q26 <- CalculateEnrichment(variants = vcf10q26snps,
features = esegs,
feature.type = "segmentations",
CI = 0.8,
return.overlaps = TRUE)
| | 0 % ~calculating
|+++++++++ | 17% ~03s
|+++++++++++++++++ | 33% ~02s
|+++++++++++++++++++++++++ | 50% ~01s
|++++++++++++++++++++++++++++++++++ | 67% ~01s
|++++++++++++++++++++++++++++++++++++++++++ | 83% ~00s
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 02s
At first perhaps we are interested which cell types have the most overlaps with with snps, say among active enhancers.
snps.10q26.enhancers <- enrich10q26$overlaps$`Active Enhancer`
fg.snps.df <- as.data.frame(mcols(snps.10q26.enhancers$foregound.overlaps))
bg.snps.df <- as.data.frame(mcols(snps.10q26.enhancers$background.overlaps))
upset(fg.snps.df, nsets = 7)
upset(bg.snps.df, nsets = 7)
Foreground Background
While it looks like the snps are largely falling within hepatocyte active enhancers, if we examine the enrichment statistics we can see otherwise. Now we’ll extract the enrichment stats and plot the enrichment profile of the variants.
enrich10q26.enrich <- enrich10q26$enrichment
PlotEnrichment(enrich10q26.enrich,
value = "difference",
block1 = "state",
color.by = "sample",
ncol = 6)
While there is a tendency for SNPs in LD with each other to be enriched in some cellular interest, it is probable that only a subset of them are of functional interest. In our case, we would like to examine the hypothesis that one of our putative functional SNPs disrupts a transcription factor binding site.
Since our index SNP was found in association with breast cancer, and we see enrichment of it’s LD proxies in MCF7, we’ll choose linked variants that overlap Active Enhancer in the MCF7 cell line from ENCODE. Segmentation features live in the overlaps slot of the enrichment object from the previous steps.
snps.10q26.enhancers <- snps.10q26.enhancers$foregound.overlaps[snps.10q26.enhancers$foregound.overlaps$MCF.7 > 0, ]
snps.10q26.enhancers
GRanges object with 20 ranges and 7 metadata columns:
seqnames ranges strand | bipolar.spindle.neuron GM12878 hepatocyte induced.pluripotent.stem.cell
<Rle> <IRanges> <Rle> | <integer> <integer> <integer> <integer>
rs72830627 chr10 [123006362, 123006362] * | 0 0 0 0
rs1896416 chr10 [123084490, 123084490] * | 0 0 1 0
rs1078806 chr10 [123338975, 123338975] * | 1 0 1 0
rs11599804 chr10 [123340664, 123340664] * | 1 0 1 0
rs1219642 chr10 [123348389, 123348389] * | 0 0 0 0
... ... ... ... . ... ... ... ...
rs34354213 chr10 [123794212, 123794212] * | 0 0 0 0
rs11200359 chr10 [123794217, 123794217] * | 0 0 0 0
rs2459091 chr10 [123819539, 123819539] * | 0 0 0 0
rs2459090 chr10 [123819715, 123819715] * | 0 0 1 0
rs2461233 chr10 [123819808, 123819808] * | 0 0 1 0
MCF.7 neutrophil DOHH2
<integer> <integer> <integer>
rs72830627 1 0 0
rs1896416 1 0 0
rs1078806 1 0 0
rs11599804 1 0 0
rs1219642 1 0 0
... ... ... ...
rs34354213 1 0 0
rs11200359 1 0 0
rs2459091 1 0 0
rs2459090 1 0 0
rs2461233 1 0 0
-------
seqinfo: 1 sequence from hg19 genome
Now that we have a SNP of interest from breast tissue enhancer, let’s work with the motifBreakR package to determine whether a putative transcription factor binding site (TFBS) is disrupted. MotifBreakR allows us to do this with nothing more than the rsID or a location and allele information. It enables us to use any genome housed in bioconductor BSgenome packages and any set of motifs curated in the MotifDB package (plus some custom databases in MotifDB format that are included with the MotifBreakR installation)1.
We also have some slides explaining how motifbreakR works in addition to the paper and vignette in the footnote indicated above.
We need to import some version of the human genome and a compatible set of variants from Bioconductor.
library(motifbreakR)
library(BSgenome.Hsapiens.UCSC.hg19)
library(SNPlocs.Hsapiens.dbSNP144.GRCh37)
We need to grab the SNP variant objects out of SNPlocs package that correspond to our enhancer proxies.
snps.mb.10q26.enhancers <- snps.from.rsid(rsid = names(snps.10q26.enhancers),
dbSNP = SNPlocs.Hsapiens.dbSNP144.GRCh37,
search.genome = BSgenome.Hsapiens.UCSC.hg19)
the condition has length > 1 and only the first element will be used
snps.mb.10q26.enhancers
GRanges object with 21 ranges and 4 metadata columns:
seqnames ranges strand | SNP_id alleles_as_ambig REF ALT
<Rle> <IRanges> <Rle> | <character> <DNAStringSet> <DNAStringSet> <DNAStringSet>
rs10736308 chr10 [123601849, 123601849] * | rs10736308 S G C
rs1078806 chr10 [123338975, 123338975] * | rs1078806 R A G
rs11200358 chr10 [123794209, 123794209] * | rs11200358 K G T
rs11200359 chr10 [123794217, 123794217] * | rs11200359 Y T C
rs113012332 chr10 [123747830, 123747830] * | rs113012332 Y C T
... ... ... ... . ... ... ... ...
rs2981584 chr10 [123350216, 123350216] * | rs2981584 M A C
rs34354213 chr10 [123794212, 123794212] * | rs34354213 Y C T
rs3862115 chr10 [123793061, 123793061] * | rs3862115 Y T C
rs72830627 chr10 [123006362, 123006362] * | rs72830627 R A G
rs76133150 chr10 [123747500, 123747500] * | rs76133150 R A G
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
Now a simple call to the motifbreakR function completes the analysis. Here, we specify that we want to filter our results by p-value (rather than scaled motif scores), we’ll set that p-value threshold at 10^-4, using information content (set argument method = "ic"
) and a uniform prior on the GC content.
results.10q26.enhancers <- motifbreakR(snpList = snps.mb.10q26.enhancers,
filterp = TRUE,
pwmList = hocomoco,
threshold = 1e-4,
method = "ic",
bkg = c(A=0.25, C=0.25, G=0.25, T=0.25),
BPPARAM = BiocParallel::bpparam())
'*' ranges were treated as '+''*' ranges were treated as '+''*' ranges were treated as '+''*' ranges were treated as '+''*' ranges were treated as '+''*' ranges were treated as '+'
results.10q26.enhancers
GRanges object with 39 ranges and 18 metadata columns:
seqnames ranges strand | REF ALT snpPos motifPos geneSymbol dataSource
<Rle> <IRanges> <Rle> | <DNAStringSet> <DNAStringSet> <integer> <integer> <character> <character>
rs1219642 chr10 [123348377, 123348389] - | T C 123348389 1 CEBPG HOCOMOCO
rs2459091 chr10 [123819533, 123819546] + | C T 123819539 7 ELF3 HOCOMOCO
rs2459091 chr10 [123819537, 123819548] + | C T 123819539 3 ELK3 HOCOMOCO
rs2981583 chr10 [123350516, 123350528] - | A G 123350523 6 EPAS1 HOCOMOCO
rs2981584 chr10 [123350207, 123350218] - | A C 123350216 3 FUBP1 HOCOMOCO
... ... ... ... . ... ... ... ... ... ...
rs2981584 chr10 [123350202, 123350218] + | A C 123350216 15 TLX1 HOCOMOCO
rs11200358 chr10 [123794197, 123794218] + | G T 123794209 13 ZBTB7B HOCOMOCO
rs11200358 chr10 [123794199, 123794217] + | G T 123794209 11 ZFX HOCOMOCO
rs34354213 chr10 [123794199, 123794217] + | C T 123794212 14 ZFX HOCOMOCO
rs11200358 chr10 [123794198, 123794212] - | G T 123794209 4 ZNF423 HOCOMOCO
providerName providerId seqMatch pctRef pctAlt scoreRef scoreAlt Refpvalue
<character> <character> <character> <numeric> <numeric> <numeric> <numeric> <logical>
rs1219642 CEBPG_si CEBPG_HUMAN acaattgcaccaT 0.9326388 0.9131911 8.110324 7.941205 <NA>
rs2459091 ELF3_f1 ELF3_HUMAN gtgataCaggaact 0.9240643 0.8269081 9.003045 8.056464 <NA>
rs2459091 ELK3_f1 ELK3_HUMAN taCaggaactgc 0.9415539 0.8235314 6.955565 6.083694 <NA>
rs2981583 EPAS1_si EPAS1_HUMAN cacacgcAcactc 0.9410674 0.8898237 7.272689 6.876672 <NA>
rs2981584 FUBP1_f1 FUBP1_HUMAN acaaagccaAga 0.7012659 0.9244287 5.395118 7.111998 <NA>
... ... ... ... ... ... ... ... ...
rs2981584 TLX1_f2 TLX1_HUMAN attgcacaaagccaAga 0.9011307 0.8206667 4.555611 4.148830 <NA>
rs11200358 ZBT7B_si ZBT7B_HUMAN agcactttgggaGgccgaggtg 0.8119717 0.7656583 9.442073 8.903514 <NA>
rs11200358 ZFX_f1 ZFX_HUMAN cactttgggaGgccgaggt 0.9352668 0.7682406 10.638093 8.738271 <NA>
rs34354213 ZFX_f1 ZFX_HUMAN cactttgggaggcCgaggt 0.9352668 0.8312528 10.638093 9.454997 <NA>
rs11200358 ZN423_f1 ZN423_HUMAN gcactttgggaGgcc 0.6739751 0.7503685 14.692071 16.357381 <NA>
Altpvalue alleleRef alleleAlt effect
<logical> <numeric> <numeric> <character>
rs1219642 <NA> 0.60000000 0.1485714 weak
rs2459091 <NA> 0.86507937 0.0000000 strong
rs2459091 <NA> 0.90566038 0.0000000 strong
rs2981583 <NA> 0.73914182 0.1676945 weak
rs2981584 <NA> 0.01904762 0.9809524 strong
... ... ... ... ...
rs2981584 <NA> 0.6653994 0.01631485 weak
rs11200358 <NA> 0.7443182 0.00000000 strong
rs11200358 <NA> 0.9929020 0.00000000 strong
rs34354213 <NA> 0.8850690 0.02484316 strong
rs11200358 <NA> 0.0000000 1.00000000 strong
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
MotifbreakR does a quick estimate of the p-value threshold initially. To obtain accurate p-values, we run “calculatePvalue” in a seperate step (beware this can be slow!).
pvalue.results.10q26.enhancers <- calculatePvalue(results = results.10q26.enhancers)
MotifbreakR includes a plotting function to visualize the SNP in its genomic context, aligned with the motif logos it is purported to alter.
plotMB(results.10q26.enhancers, "rs34354213", effect = "strong")
Here we can see that the variant disrupts a Pax-5 motif which is interesting given the results in this paper Breast Cancer Malignant Processes are Regulated by Pax-5 Through the Disruption of FAK Signaling Pathways
see the manuscript and vignette for details.↩