data-analysis

Data Analysis

Traditional techniques are unable to provide information of detrimental or neutral mutations, and only can inform about beneficial mutations that after few rounds of adaptation are fixed or nearly fixed in the population. The information generated by CirSeq is unique with respect to density (single nucleotide resolution) and nature (providing information on lethal, neutral and weak beneficial alleles). We consistently obtained 10e5-10e6 reads per nucleotide, from which we calculate the mutation frequency per nucleotide position in the viral genome. This provides an unprecedented wealth of genetic information. CirSeq analysis is performed on the T-BioInfo platform. Below is a screenshot of the Virology analysis graph, which includes algorithms for analyzing CirSeq data as well as regular NGS data. Hover over the buttons in the image below to read their descriptions. Data Mining is critical for a researcher that has produced huge quantities of data and needs to understand the complex phenomenons occurring.

Overall quality of the CirSeq libraries

General statistics for CirSeq generated reads:
  1. 1. Lengths of reads
  2. 2. PHRED quality of nucleotides in the reads
Statistics for slicing CirSeq reads into repeats:
  1. 1. Distribution of repeat sizes
  2. 2. Quality of consensuses
Statistics for mapping the CirSeq consensuses onto:
  1. 1. Virus genome
  2. 2. Host genome

Algorithm to obtain consensus sequences

  1. 1. Splicing the CirSeq reads into repeats based on the most frequent distance between the same kmers in the read.
  2. 2. Mapping consensuses of the CirSeq repeats onto virus genome. Counts are calculated as:

-For symmetric data: sum of mapped consensuses per position

-For non-symmetric data: sum of weights of individual consensuses per position

  1. 3. Strict filtering out of consensuses based on matching nucleotides of high PHRED quality at least in two repeats of the consensus
  1. 4. Soft filtering out of consensuses based on matching nucleotides of medium PHRED quality in special combinations of three repeats. Each accepted consensus combination in a position provides a probability (weight) for a position to be a true consensus.

Mutation rate across passages and fitness of variants

The fitness model-based Bayesian regression approach is used to detect fitness of individual mutation variants in CirSeq data. The fitness estimation is averaged across many trials of the simulated binomial genetic drift. The estimations are performed in both time directions across passages in order to detect truly trustable beneficial and detrimental mutations, and confidence intervals for their fitness values. Model-free fitness estimation relies on major components of the general behavior of mutation frequency across passages for the entire pool of positions. Detection of beneficial and detrimental mutations is based on Wilcoxon scores for increases/decreases in frequency profiles coupled with a projection of the given mutation frequency profile on the TimeFit component of the multi-dimensional distribution of mutation frequency profiles.

QuasiSpeciesDetection of quasi-species as sets of epistatic mutations

In a long series of passages, a vector of values is calculated for each mutation as the sliding window fitness estimations across series of passages. This fitness value profile (vector) robustly reflects specificity of the given mutation in virus population development. A network of mutations (interlinked fitness profiles) is prepared by the Markov field (Gaussian Graphic Model -- GGM) and BiAssociation algorithms. Each module of tightly interconnected mutation variants of this network is interpreted as a group of epistatic mutations that in turn defines the core set of mutations of a quasi-species. The NetModules are detected by the P-clustering algorithm of the platform.