Cambridge Bioinformatics Hackathon

On the 25-27th September 2017 we held our first Cambridge Bioinformatics Hackathon (@CamBioHack / #CamHack17), bringing together programmers to work on bioinformatics or computational biology projects.

We submitted a summary of the projects to F1000Research. This work has now been assigned a digital object identifier (DOI) and is fully citable:

Wingett S, Ayres D, Bagshaw A et al. Cambridge Bioinformatics Hackathon 2017 [version 1; not peer reviewed]. F1000Research 2017, 6:1942 (slides) (doi: 10.7490/f1000research.1115038.1)

For further details, please contact steven.wingett@babraham.ac.uk

Projects Overview

Project Git
Adding functionality to FastQC so that it detects and reports the extent of putative flowcell close-proximity duplicate clusters in FASTQ files.
A web based tool for integration and visualisation of targeted chromatin interaction datasets with genetic/genomic data. Specifically I want to add functionality for integrating and querying variant positional data from the BRIDGE dataset of 10k whole genome sequences, both though frontend and elastic search backend using R. See https://www.chicp.org and https://www.ncbi.nlm.nih.gov/pubmed/27153610 for further details. https://github.com/ollyburren/django-chicp
Topologically Associated Domain (TAD) discovery with neural networks. TADs show up on Hi-C contact maps as diagonally symmetric squares along the diagonal. The idea is to train a neural network on known TADs and then to try to detect them in other Hi-C contact maps. Perhaps there is enough time to compare to a current TAD calling algorithm. https://bitbucket.org/account/user/nncm/projects/NNTAD
Using neural networks for RNA secondary structure prediction.
An interactive R/shiny tool to perform and visualise gene-clustering based on RNAseq data. https://github.com/gogleva/shiny_transcriptomes
Write a walking assembler (SeqWalker) to attempt to resolve the structure of the IgH locus in the 129 mouse strain. This uses iterative alignments using the Last aligner, and sequence extension. The software does so far work fine with tiled simulated long reads, its performance with real ONT reads will need to be established in the near future. https://github.com/FelixKrueger/CamHack2017
Predicting from sequencing data by which protocol a bisulfite library was generated.
1. Single-cell RNA-Seq cell type prediction: R package to predict the cell types of single-cell RNA-Seq samples based on prior collection of annotated gold-standard single-cell RNA-Seq experiments. 2. Automatic P-value thresholding: In Bioinformatics, we do multiple hypothesis testing a lot, e.g. in GWAS and differential expression analysis. This introduces many false positives and currently, we perform a multiple-testing correction and choose an arbitrary threshold to limit them. But, this manual thresholding is sub-optimal: your significant results may unnecessarily include too many false positives or exclude too many true positives. This project illustrates an idea for an automatic and data-driven alternative to finding an optimal P-value threshold.
Updating ParticleStats with new features with the plan to release version 2.0 and write up for publication. https://github.com/CTR-BFX/CambridgeHackathon and https://github.com/darogan/ParticleStats
A small number of limited studies have shown that length variation in microsatellite repeats located in introns may influence alternative splicing quite commonly, but genome-wide investigations are lacking (Bagshaw 2017 https://doi.org/10.1093/gbe/evx164). Preliminary data made available to me by Melissa Gymrek at UCSD show that splice STRs are detectable in the gEUVADIS project’s RNA sequencing data. I plan to extend this work by looking at various methods of alternative-splicing detection, and also other datasets including the Genotype-Tissue Expression (GTEx) Project.
Machine learning classification on small peptide prediction using Ribosome Profiling. https://github.com/boboppie/orf-discovery
De novo assembly of the raw Oxford Nanopore reads using PHRAP.
Lightweight, in-browser app for discovering publicly available datasets in relation to biological pathways. https://github.com/OmicsDI
Preliminary work on parallelisation of stepwise addition method for starting trees in maximum likelihood phylogenetics inference. https://github.com/beagle-dev/beagle-lib
An interactive web app using 23andme API to map and visualize different SNPs for different features (such as different traits, genetic risk factors etc) on the chromosomes and combining this with ancestry information. https://github.com/pandora2017/genomic_trait_mapper
Lab workbench – a Shiny workbench for easy access and display of my group’s data.
A guided web based system for statistical power analysis. https://github.com/s-andrews/powerguide
Adding new features to SeqPlots – an interactive software for exploratory data analyses, pattern discovery and visualization in genomics. SeqPlots (Stempor and Ahringer 2016, https://wellcomeopenresearch.org/articles/1-14) is a genomic data visualization software written in R using Shiny framework (https://github.com/Przemol/seqplots) SeqPlots delivers both R interface (https://bioconductor.org/packages/seqplots) and easy to use GUI desktop package wrapped in Electron framework (http://seqplots.ga/). A year after its release I have collected a feedback and suggestions for improvements and additional features to include. I would like to focus on these ones:
– Support for custom build reference genomes in FASTA format – the current version uses BSGenome packages, which inhibits non-R users from using genome builds that are not available in Bioconductor repository
– Metagene analysis of the plus and minus strand reads using plus and minus strand BigWiggle files – useful for strand specific RNA-seq and downstream analyses of CRAC data (https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-3-r30)
– Add support for narrowPeak formatted files, often used as output of peak callers
– Automated testing of GUI using shinytest (https://github.com/rstudio/shinytest) – current approach using Selenium server often fails for no good reason with continues integration services (Travis CI and AppVeyor)
https://github.com/Przemol/seqplots
Restarting the EMBOSS suite of sequence analysis applications. Started in Cambridge in 1997 the code has been dormant for 5 years. Building a new code repository, updating the project sites, adding bug and features trackers, populating with the (relatively few) current known issues. Code in C and java. emboss.sourceforge.net
Using network analysis to identify cross-species transmissions and recombination in virus genomes. https://github.com/KatyBrown/VirusNetworks
A guided web based system for statistical power analysis https://github.com/s-andrews/powerguide
Initial QC and analysis of immunoglobulin repertoire data. https://github.com/LouiseMatheson/analyseVDJ
Restarting the EMBOSS suite of sequence analysis applications. Started in Cambridge in 1997 the code has been dormant for 5 years. Building a new code repository, updating the project sites, adding bug and features trackers, populating with the (relatively few) current known issues. Code in C and java. emboss.sourceforge.net
Using network analysis to identify cross-species transmissions and recombination in virus genomes. https://github.com/KatyBrown/VirusNetworks
RNA-seq QC tools. https://github.com/CTR-BFX/CambridgeHackathon
Detecting copy number variation in malaria parasite samples. https://github.com/tnguyensanger/pf_swga_cnv
Machine learning and data mining.
Converting MATLAB scripts to Python while learning Python.

Many thanks to Genestack for sponsoring the Cambridge Bioinformatics Hackathon.