A research study focused on observing the links between the concussions and the gut microbiome.
This vignette contains all of the code necessary for the microbiome analysis within NFL-funded concussion and gut microbiome study between Rice University and Houston Methodist Research Institute. The gut microbiome analysis of this study was performed through long-read nanopore metagenomic sequencing performed by moi at HMRI with raw fastq data accessible at BioProject link will be available here.
We also included metagenome-assembled genome (MAG) analysis using the program Aviary. To further show off this data we created an interactive GUI of the results so far which can be seen here. To use it just download the data found in the github folder data/aviary_bin_info and upload to the site.
library(dada2)
library(dplyr)
library(janitor)
library(biomformat)
library(Biostrings)
library(tibble)
library(digest)
library(phyloseq)
library(microViz)
library(ggplot2)
library(ggprism)
library(ggsci)
library(forcats)
library(FSA)
library(ggsignif)
library(broom)
library(igraph)
library(visNetwork)
library(SpiecEasi)
library(progressr)
library(furrr); plan(multisession, workers = 72)
This is the script I used to preprocess the fastq files after basecalling. We made a conda env and used NanoPlot v1.43.0, hostile v1.1.0, and chopper v0.9.0.
#! /bin/bash
#SBATCH --time=04-00:00:00
#SBATCH --partition=defq
#SBATCH --mail-user=agmarshall@houstonmethodist.org
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --ntasks-per-node=36
#SBATCH --mem=192GB
#SBATCH --nodes=1
#SBATCH --job-name=metassem
#SBATCH --comment=metassem
#Load environment
module load mamba
mamba activate metassem
# Loop over all .fastq files in the current directory
for file in *.fastq; do
# Remove the .fastq extension to get the base name
prefix="${file%.fastq}"
echo "Processing $file..."
# First NanoPlot command on the raw file
NanoPlot -t 64 --fastq "$file" --maxlength 25000 --plots dot -o "${prefix}_nanplot1"
# Run hostile clean: outputs are placed into a directory "cleaned_<prefix>"
hostile clean --fastq1 "$file" --threads 64 --index human-t2t-hla --aligner minimap2 --out-dir "cleaned_${prefix}"
# Define the expected output file from hostile clean
cleaned_file="cleaned_${prefix}/${prefix}.clean.fastq.gz"
# Check if the cleaned file exists before proceeding
if [[ -f "$cleaned_file" ]]; then
# Use the hostile clean output as input for chopper
chopper -q 10 -l 250 -t 64 -i "$cleaned_file" > "${prefix}_final.fastq"
else
echo "Error: Cleaned file $cleaned_file not found. Skipping chopper step for $file."
continue
fi
# Second NanoPlot command on the final file
NanoPlot -t 64 --fastq "${prefix}_final.fastq" --maxlength 15000 --plots dot -o "${prefix}_nanplot2"
echo "Finished processing $file."
done
In the beginning of this markdown, we will perform all the steps needed to classify long-read metagenomic data using lemur. This used the updated lemur script that produces abundance count files and not just relative abundance.
We used the final reads output from the step above for the rest of our analyses, starting here with lemur / magnet.
#! /bin/bash
#SBATCH --time=04-00:00:00
#SBATCH --partition=defq
#SBATCH --mail-user=agmarshall@houstonmethodist.org
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --ntasks-per-node=64
#SBATCH --mem=192GB
#SBATCH --nodes=1
#SBATCH --job-name=lemur
#SBATCH --comment=lemur
module load mamba
conda activate lemur
# LEMUR WITH COUNTS OUTPUT NEW CODE FOR PULL REQUEST (in lemur github dir in scratch)
# Define the directory containing your FASTQ files.
input_dir="/home/tmhagm8/scratch/rice_concuss/preprocessed_reads"
# Loop over all FASTQ files matching the pattern (adjust pattern if needed)
for file in "$input_dir"/*.fastq; do
# Extract the base name (e.g., "mid_B010" from "mid_B010_final.fastq")
base=$(basename "$file" .fastq)
# Run the lemur command.
python /home/tmhagm8/scratch/lemur/lemur_write_abunds \
-i "$file" \
-o "${input_dir}/${base}_lemur" \
-t 64 \
-d /home/tmhagm8/scratch/lemur_db/rv221bacarc-rv222fungi \
--tax-path /home/tmhagm8/scratch/lemur_db/rv221bacarc-rv222fungi/taxonomy.tsv \
--mm2-type map-ont \
-r species
# Run the magnet command.
python /home/tmhagm8/scratch/magnet/magnet.py \
-c "${input_dir}/${base}_lemur/relative_abundance.tsv" \
-i "$file" \
-m ont \
-o "${input_dir}/${base}_magnet" \
--min-abundance 0.02 \
--threads 64
done
mkdir $input_dir/lemur_counts
mv $input_dir/*_lemur/*abundance_counts.tsv lemur_counts
# And this is how I made the phyloseq inputs from the lemur data
python /home/tmhagm8/scratch/rice_concuss/25_02_12_rice_trial/lemur_to_phyloseq.py -i $input_dir/lemur_counts --otu rice_trial_ps_otu_table.tsv --tax rice_trial_ps_tax_table.tsv
Updated python script here, does not need to be included in final Rmd.
#!/usr/bin/env python3
#
# Created on 11/07/2023 by Nick Sapoval
# Revised to output estimated counts in abundance_counts.tsv and label output files with the input basename
#
import datetime
import argparse
import os
import logging
import sys
import signal
import time
import shutil
import subprocess as subp
from itertools import repeat
from multiprocessing import Pool
import pandas as pd
import numpy as np
import pysam
= "1.0.1"
__version__
def CTRLChandler(signum, frame):
global SIGINT
= True
SIGINT "Caught request to terminate (CTRL+C), exiting...\n")
sys.stderr.write(128)
sys.exit(
signal.signal(signal.SIGINT, CTRLChandler)
class LemurRunEnv():
= ['RpsE', 'RpsG', 'RpoA', 'RplK', 'RpsQ', 'RplA', 'TsaD', 'RpsK', 'LeuS', 'RpsS',
GENES 'RpsM', 'Ffh', 'RplO', 'SecY', 'ArgS', 'CysS', 'PheS', 'RplB', 'RpsB', 'RplP',
'RplV', 'RpsO', 'Gtp1', 'RpoB', 'RplD', 'RpsL', 'RplC', 'RpsI', 'RpsD', 'RplM',
'SerS', 'RplR', 'ValS', 'FtsY', 'RplF', 'RpsH', 'RplN', 'HisS', 'RpsC', 'RplE']
= ["M", "I", "D", "=", "X", "H", "S"]
CIGAR_OPS_CHAR = [0, 1, 2, 7, 8, 5, 4]
CIGAR_OPS_INT = ['species', 'genus', 'family', 'order', 'class', 'phylum', 'clade', 'superkingdom']
TAXONOMY_RANKS = {0: 1.,
FIXED_COSTS 1: 0.005,
2: 0.005,
7: 1.,
8: 0.01,
4: 0.05,
5: 0.001}
def __init__(self):
self.CURRDATE = datetime.date.today().strftime('%Y-%m-%d')
self.CURRTIME = datetime.datetime.now().strftime('%H:%M:%S')
self.args = self.parse_args()
self.logger = self.logging_setup()
self.aln_score = self.args.aln_score
self.by_gene = self.args.aln_score_gene
# Compute the basename of the input file (without extension)
self.input_basename = os.path.splitext(os.path.basename(self.args.input))[0]
if self.args.sam_input:
self.sam_path = self.args.sam_input
else:
self.sam_path = os.path.join(self.args.output, "reads.sam")
if os.path.exists(self.args.output):
self.log(f"Output directory {self.args.output} exists! Results will be overwritten...", logging.WARNING)
self.args.output)
shutil.rmtree(self.args.output)
os.makedirs(
# Set output prefix for relative abundance files with the input basename
self.tsv_output_prefix = os.path.join(self.args.output, f"{self.input_basename}_relative_abundance")
self.lli_threshold = 0.01
self.low_abundance_threshold = 0.0001
self.rank = self.args.rank
def parse_args(self):
= argparse.ArgumentParser(
parser ="""
description Lemur example:
lemur -i <input> -o <output_dir> -t <threads>
This tool will run minimap2, build alignment models, and then perform an EM algorithm to
estimate species abundances. In addition to the original output files, you can now specify
a filename for the estimated abundance counts output using the --abundance-counts option.
"""
)'-v', '--version', action='version', version='%(prog)s ' + __version__)
parser.add_argument(
= parser.add_argument_group(title="Main Mob arguments")
main_args
main_args.add_argument("-i",
"--input",
type=str,
help="Input FASTQ file for the analysis",
=True
required
)
main_args.add_argument("-o",
"--output",
type=str,
=f"mob_out_{self.CURRDATE}_{self.CURRTIME}/",
defaulthelp="Folder where the Mob output will be stored (default: mob_out_<date>_<time>/)"
)
main_args.add_argument("-d",
"--db-prefix",
type=str,
help="Path to the folder with individual Emu DBs for each marker gene"
)
main_args.add_argument("--tax-path",
type=str,
help="Path to the taxonomy.tsv file (common for all DBs)"
)
main_args.add_argument("-t",
"--num-threads",
type=int,
=20,
defaulthelp="Number of threads to use (default: 20)"
)
main_args.add_argument("--aln-score",
=['AS', 'edit', 'markov'], default='AS',
choiceshelp='Alignment scoring method: AS uses SAM AS tag, edit uses edit-type distribution, markov scores CIGAR as a Markov chain (default: AS)',
type=str
)
main_args.add_argument("--aln-score-gene",
="store_true",
actionhelp="Use gene-specific alignment scoring"
)
main_args.add_argument("-r",
"--rank",
type=str,
="species",
defaulthelp="Taxonomic rank for output collapse (default: species)"
)
main_args.add_argument("--min-aln-len-ratio",
type=float,
=0.75,
defaulthelp="Minimum alignment length ratio (default: 0.75)"
)
main_args.add_argument("--min-fidelity",
type=float,
=0.50,
defaulthelp="Minimum fidelity for an alignment (default: 0.50)"
)
main_args.add_argument("--ref-weight",
type=float,
=1.,
defaulthelp="Reference weight for log likelihood adjustments (default: 1.)"
)# New argument for abundance counts output file
main_args.add_argument("--abundance-counts",
type=str,
="abundance_counts.tsv",
defaulthelp="Filename for the estimated abundance counts output (default: abundance_counts.tsv)"
)# Minimap2 specific arguments
= parser.add_argument_group(title="minimap2 arguments")
mm2_args
mm2_args.add_argument('--mm2-N', type=int, default=50,
help='Maximum number of secondary alignments per read for minimap2 (default: 50)'
)
mm2_args.add_argument('--mm2-K', type=int, default=500000000,
help='Minibatch size for minimap2 mapping (default: 500M)'
)
mm2_args.add_argument('--mm2-type', choices=['map-ont', 'map-hifi', 'map-pb', 'sr'], default='map-ont',
help='Mapping type for minimap2: map-ont, map-hifi, map-pb, or sr (default: map-ont)'
)# Verbosity/logging/additional info
= parser.add_argument_group(title="Miscellaneous arguments")
misc_args
misc_args.add_argument("--keep-alignments",
="store_true",
actionhelp="Keep SAM files after mapping (can use a lot of disk space)"
)
misc_args.add_argument("-e",
"--log-file",
type=str,
="stdout",
defaulthelp="Log file destination (default: stdout)"
)
misc_args.add_argument("--sam-input",
type=str,
help="Provide a SAM file as input instead of running minimap2"
)
misc_args.add_argument("--verbose",
='store_true',
actionhelp="Enable DEBUG level logging"
)
misc_args.add_argument("--save-intermediate-profile",
="store_true",
actionhelp="Save the abundance profile at every EM step"
)
misc_args.add_argument("--width-filter",
="store_true",
actionhelp="Apply width filter"
)
misc_args.add_argument("--gid-name",
="store_true",
actionhelp="Use gene id in reference name"
)
return parser.parse_args()
def init_taxonomy(self):
self.df_taxonomy = pd.read_csv(self.args.tax_path, sep='\t', index_col='tax_id', dtype=str)
self.log(self.df_taxonomy)
self.log(f"Loaded taxonomy file {self.args.tax_path}")
def init_F(self):
'''Initializes the frequency vector F representing the relative abundance of every species in the reference database.'''
self.F = pd.Series(data=[1. / len(self.df_taxonomy)] * len(self.df_taxonomy),
=self.df_taxonomy.index,
index="F")
nameself.log("Initialized abundance vector")
def run_minimap2(self):
'''
Runs minimap2 with the following command:
minimap2 -ax <mm2_type> -N <mm2-N> -K <mm2-K> -p .9 -f 0 --sam-hit-only --eqx -t <num_threads> <db>/species_taxid.fasta <input> -o <sam_path>
'''
= os.path.join(self.args.db_prefix, 'species_taxid.fasta')
db_sequence_file = f"minimap2 -ax {self.args.mm2_type} -N {self.args.mm2_N} -K {self.args.mm2_K} -p .9 -f 0 --sam-hit-only --eqx -t {self.args.num_threads} {db_sequence_file} {self.args.input} -o {self.sam_path}"
cmd_str = subp.Popen(cmd_str,
p =True,
shell=None,
stdin=subp.PIPE,
stdout=subp.PIPE,
stderr=True,
close_fds=True)
text= p.communicate()
fstdout, fstderr = p.returncode
rc
if rc != 0:
self.log(f"minimap2 failed executing:\n{cmd_str}\n\nSTDOUT:{fstdout}\n\nSTDERR:{fstderr}", logging.CRITICAL)
sys.exit(rc)else:
self.log(fstdout, logging.DEBUG)
self.log(fstderr, logging.DEBUG)
def build_alignment_model(self):
if self.args.aln_score == "AS":
return
elif self.args.aln_score == "markov":
self.build_transition_mats()
elif self.args.aln_score == "edit":
self.edit_cigar = self.build_edit_cost()
else:
self.fixed_cigar = {0: 1.,
1: 0.005,
2: 0.005,
7: 1.,
8: 0.01,
4: 0.05,
5: 0.01}
def build_edit_cost(self):
'''
Constructs a dictionary with the operation costs for each value in the CIGAR string.
Returns:
op_costs (dict): Dictionary mapping CIGAR op indices to cost.
'''
= {0: 0., 1: 0., 2: 0., 7: 0., 8: 0., 4: 0., 5: 0.}
op_costs = [1, 2, 8, 4, 5]
ops_w_cost
if self.by_gene:
pass
else:
= 0
total = self.extract_cigars_all(self.sam_path)
cigars for cigar in cigars:
for t in cigar:
= t
OP, L if OP in ops_w_cost:
+= L
op_costs[OP] += L
total for OP in op_costs:
if OP in ops_w_cost:
= op_costs[OP] / total
op_costs[OP] else:
= 1.
op_costs[OP] return op_costs
def build_transition_mats(self):
if self.by_gene:
= self.extract_cigars_per_gene(self.sam_path)
gene_cigars self.gene_transition_mats = {self.GENES[i]: mat
for i, mat in enumerate(map(self.build_transition_mat, gene_cigars.values()))
}else:
= self.extract_cigars_all(self.sam_path)
cigars self.transition_mat = self.build_transition_mat(cigars)
def extract_cigars_per_gene(self, sam_path):
= pysam.AlignmentFile(sam_path)
samfile = {gene: [] for gene in self.GENES}
gene_cigars for read in samfile.fetch():
if not read.is_secondary and not read.is_supplementary:
= read.reference_name.split(":")[1].split("/")[-1]
gene
gene_cigars[gene].append(read.cigartuples)return gene_cigars
def extract_cigars_all(self, sam_path):
= pysam.AlignmentFile(sam_path)
samfile = []
cigars for read in samfile.fetch():
if not read.is_secondary and not read.is_supplementary:
cigars.append(read.cigartuples)return cigars
def build_transition_mat(self, cigars):
= pd.DataFrame(
transition_mat =[[0] * (len(self.CIGAR_OPS_INT) + 1) for _ in range(len(self.CIGAR_OPS_INT) + 1)],
data=self.CIGAR_OPS_INT + [-1],
index=self.CIGAR_OPS_INT + [-1]
columns
)for cigar in cigars:
= len(cigar)
N = None
prev_OP for i, t in enumerate(cigar):
= t
OP, L if not ((i == 0 or i == N - 1) and (OP == self.CIGAR_OPS_INT[-1])):
if OP in self.CIGAR_OPS_INT:
+= L - 1
transition_mat.at[OP, OP] if prev_OP is not None:
+= 1
transition_mat.at[prev_OP, OP] = OP
prev_OP = transition_mat.div(transition_mat.sum(axis=1), axis=0).fillna(0.)
transition_mat self.log(transition_mat, logging.DEBUG)
return transition_mat
@staticmethod
def __get_aln_len(aln):
'''Gets the alignment length from the CIGAR stats.'''
= aln.get_cigar_stats()[0]
_, I, _, _, _, _, _, E, X, _, _ return I + E + X
def build_P_rgs_df(self):
'''
Builds a pandas DataFrame containing the likelihood values for each Read-Target mapping.
'''
= pysam.AlignmentFile(self.sam_path)
samfile = {"Read_ID": [], "Target_ID": [], "log_P": [], "cigar": [], "Gene": [], "Reference": [], "aln_len": []}
P_rgs_data
for i, aln in enumerate(samfile.fetch()):
= aln.query_name
qname = aln.get_tag("AS")
aln_score if aln_score > 0:
if self.args.gid_name:
= aln.reference_name.rsplit("_", maxsplit=1)[0]
species_tid else:
= int(aln.reference_name.split(":")[0])
species_tid = aln.reference_name.split(":")[1].split("/")[-1]
gene = aln.cigartuples
cigar
if self.args.aln_score == "AS":
= np.log(aln_score / (2 * self.__get_aln_len(aln)))
log_P_score "log_P"].append(log_P_score)
P_rgs_data[
"Read_ID"].append(qname)
P_rgs_data["Target_ID"].append(species_tid)
P_rgs_data[if not self.args.aln_score == "AS":
"cigar"].append(cigar)
P_rgs_data["Gene"].append(gene)
P_rgs_data["Reference"].append(aln.reference_name)
P_rgs_data["aln_len"].append(self.__get_aln_len(aln))
P_rgs_data[
if (i + 1) % 100000 == 0:
self.log(f"build_P_rgs_df extracted {i+1} reads", logging.DEBUG)
if not self.args.aln_score == "AS":
with Pool(self.args.num_threads) as pool:
if self.aln_score == "markov":
if self.by_gene:
= [(cigar, self.gene_transition_mats[P_rgs_data["Gene"][i]])
cigar_gene_mat_tuples for i, cigar in enumerate(P_rgs_data["cigar"])]
"log_P"] = pool.starmap(self.score_cigar_markov, cigar_gene_mat_tuples)
P_rgs_data[else:
"log_P"] = pool.starmap(self.score_cigar_markov,
P_rgs_data[zip(P_rgs_data["cigars"], repeat(self.transition_mat)))
elif self.aln_score == "edit":
if self.by_gene:
= self.score_cigar_fixed(P_rgs_data["cigar"][i], self.gene_edit_cigars[gene])
log_P_func else:
"log_P"] = pool.starmap(self.score_cigar_fixed,
P_rgs_data[zip(P_rgs_data["cigar"], repeat(self.edit_cigar)))
else:
if self.by_gene:
= self.score_cigar_fixed(P_rgs_data["cigar"][i], self.gene_fixed_cigars[gene])
log_P_func else:
"log_P"] = pool.starmap(self.score_cigar_fixed,
P_rgs_data[zip(P_rgs_data["cigar"], repeat(self.fixed_cigar)))
if "cigar" in P_rgs_data:
del P_rgs_data["cigar"]
self.P_rgs_df = pd.DataFrame(data=P_rgs_data)
self.P_rgs_df["max_aln_len"] = self.P_rgs_df.groupby("Read_ID")["aln_len"].transform('max')
self.P_rgs_df["log_P"] = self.P_rgs_df["log_P"] * self.P_rgs_df["max_aln_len"] / self.P_rgs_df["aln_len"]
self.P_rgs_df["max_log_P"] = self.P_rgs_df.groupby("Read_ID")["log_P"].transform('max')
self.P_rgs_df.to_csv(os.path.join(self.args.output, "P_rgs_df_raw.tsv"), sep='\t', index=False)
self.gene_stats_df = pd.read_csv(os.path.join(self.args.db_prefix, "gene2len.tsv"), sep='\t')
self.gene_stats_df["type"] = self.gene_stats_df["#id"].str.split("_").str[-1].str.split(":").str[0]
self.gene_stats_df["Target_ID"] = self.gene_stats_df["#id"].str.split(":").str[0].astype(int)
= self.P_rgs_df.merge(self.gene_stats_df,
gene_p_rgs_len_df ="left", right_on="#id", left_on="Reference")
how"aln_len_ratio"] = gene_p_rgs_len_df["aln_len"] / gene_p_rgs_len_df["length"]
gene_p_rgs_len_df["fidelity"] = gene_p_rgs_len_df["log_P"] / gene_p_rgs_len_df["aln_len"]
gene_p_rgs_len_df[
"ref_len_weighted_log_P"] = gene_p_rgs_len_df["log_P"] + \
gene_p_rgs_len_df[self.args.ref_weight * np.log(gene_p_rgs_len_df["aln_len_ratio"])
self.args.output, "gene_P_rgs_df_raw.tsv"), sep='\t', index=False)
gene_p_rgs_len_df.to_csv(os.path.join(if self.args.ref_weight != 0:
self.P_rgs_df["log_P"] = gene_p_rgs_len_df["ref_len_weighted_log_P"]
if self.args.aln_score == "AS":
self.P_rgs_df = self.P_rgs_df[(gene_p_rgs_len_df["aln_len_ratio"] >= self.args.min_aln_len_ratio) &
"log_P"] >= 1.1 * gene_p_rgs_len_df["max_log_P"]) &
(gene_p_rgs_len_df["log_P"] >= np.log(self.args.min_fidelity))]
(gene_p_rgs_len_df[= gene_p_rgs_len_df[(gene_p_rgs_len_df["aln_len_ratio"] >= self.args.min_aln_len_ratio) &
gene_p_rgs_len_df "log_P"] >= 1.1 * gene_p_rgs_len_df["max_log_P"]) &
(gene_p_rgs_len_df["log_P"] >= np.log(self.args.min_fidelity))]
(gene_p_rgs_len_df[else:
self.P_rgs_df = self.P_rgs_df[(gene_p_rgs_len_df["aln_len_ratio"] >= self.args.min_aln_len_ratio) &
"fidelity"] >= self.args.min_fidelity)]
(gene_p_rgs_len_df[= gene_p_rgs_len_df[(gene_p_rgs_len_df["aln_len_ratio"] >= self.args.min_aln_len_ratio) &
gene_p_rgs_len_df "fidelity"] >= self.args.min_fidelity)]
(gene_p_rgs_len_df[
= pd.read_csv(os.path.join(self.args.db_prefix, "reference2genome.tsv"), sep='\t',
ref2genome_df =["Name", "Genome"])
namesself.P_rgs_df = self.P_rgs_df.merge(ref2genome_df, how="left", left_on="Reference", right_on="Name")
self.P_rgs_df = self.P_rgs_df.groupby(by=["Read_ID", "Target_ID"]).max()
= self.P_rgs_df.reset_index().groupby("Read_ID").count().reset_index()[["Read_ID", "Target_ID"]]
__df_id_count = ["Read_ID", "Map_Count"]
__df_id_count.columns = self.P_rgs_df.reset_index().merge(__df_id_count, how="left", on="Read_ID")
__df self.unique_mapping_support_tids = set(__df[__df["Map_Count"] == 1]["Target_ID"])
self.log(self.P_rgs_df, logging.DEBUG)
self.P_rgs_df.reset_index().to_csv(os.path.join(self.args.output, "P_rgs_df.tsv"), sep='\t', index=False)
return self.P_rgs_df
def logging_setup(self):
= logging.getLogger("Mob")
logger
logger.setLevel(logging.DEBUG)
if self.args.log_file == "stdout":
= logging.StreamHandler(stream=sys.stdout)
handler elif self.args.log_file == "stderr":
= logging.StreamHandler(stream=sys.stderr)
handler else:
= logging.FileHandler(self.args.log_file, mode='w')
handler
= logging.StreamHandler(stream=sys.stderr)
error_handler
error_handler.setLevel(logging.ERROR)
if self.args.verbose:
handler.setLevel(logging.DEBUG)else:
handler.setLevel(logging.INFO)
= logging.Formatter("%(asctime)s %(message)s", datefmt="%Y-%m-%d %I:%M:%S %p")
formatter
handler.setFormatter(formatter)
error_handler.setFormatter(formatter)
logger.addHandler(handler)
logger.addHandler(error_handler)return logger
def log(self, msg, level=logging.DEBUG):
if level == logging.DEBUG:
self.logger.debug(f"DEBUG:\t{msg}")
elif level == logging.INFO:
self.logger.info(f"INFO:\t{msg}")
elif level == logging.WARNING:
self.logger.warning(f"WARNING:\t{msg}")
elif level == logging.ERROR:
self.logger.error(f"ERROR:\t{msg}")
elif level == logging.CRITICAL:
self.logger.critical(f"CRITICAL:\t{msg}")
@staticmethod
def score_cigar_markov(cigar, T):
= len(cigar)
N = None
prev_OP = 0.
log_P for i, t in enumerate(cigar):
= t
OP, L if not ((i == 0 or i == N - 1) and (OP == LemurRunEnv.CIGAR_OPS_INT[-1])):
if OP in [1, 2, 7, 8, 5, 4]:
if T.at[OP, OP] != 0:
+= (L - 1) * np.log(T.at[OP, OP])
log_P else:
+= (L - 1) * np.log(LemurRunEnv.FIXED_COSTS[OP])
log_P if prev_OP is not None:
if T.at[prev_OP, OP] != 0:
+= np.log(T.at[prev_OP, OP])
log_P = OP
prev_OP return log_P
@staticmethod
def score_cigar_fixed(cigar, fixed_op_cost):
= len(cigar)
N = [ct for i, ct in enumerate(cigar)]
aln_cigar = sum(map(lambda OP_L: np.log(fixed_op_cost[OP_L[0]]) * OP_L[1], aln_cigar))
log_P return log_P
@staticmethod
def logSumExp(ns):
= np.max(ns)
__max if not np.isfinite(__max):
= 0
__max = ns - __max
ds with np.errstate(divide='ignore'):
= np.exp(ds).sum()
sumOfExp return __max + np.log(sumOfExp)
def EM_step(self, final=False):
if final:
self.F = self.final_F
self.P_tgr = self.P_rgs_df.reset_index().merge(self.F, how="inner", left_on="Target_ID", right_index=True)
self.P_tgr["P(r|t)*F(t)"] = self.P_tgr.log_P + np.log(self.P_tgr.F)
self.P_tgr_sum = EM_get_Prgt_Ft_sums(self.P_tgr, self.args.num_threads)
self.P_tgr = self.P_tgr.merge(self.P_tgr_sum,
="left",
how="Read_ID",
left_on=True,
right_index=["", "_sum"])
suffixesself.P_tgr["P(t|r)"] = self.P_tgr["P(r|t)*F(t)"] - self.P_tgr["P(r|t)*F(t)_sum"]
= len(self.P_rgs_df.reset_index()["Read_ID"].unique())
n_reads self.F = self.P_tgr[["Target_ID", "P(t|r)"]].groupby("Target_ID") \
lambda x: np.exp(LemurRunEnv.logSumExp(x) - np.log(n_reads)))["P(t|r)"]
.agg(self.F.name = "F"
self.F = self.F.loc[self.F != 0]
self.log(self.F.sum(), logging.DEBUG)
if final:
self.final_F = self.F
def compute_loglikelihood(self):
return self.P_tgr["P(r|t)*F(t)_sum"].sum()
def EM_complete(self):
= len(set(self.P_rgs_df.reset_index()["Read_ID"]))
n_reads if n_reads == 0:
self.log(f"No alignments available for EM algorithm.\nVerify SAM file, {self.args.output}/P_rgs_df_raw.tsv, and consider lowering --min-aln-len-ratio or --min-fidelity flags.", logging.ERROR)
1)
exit(else:
self.low_abundance_threshold = 1. / n_reads
if self.args.width_filter:
= self.P_rgs_df.reset_index()
__P_rgs_df = list(self.F.index)
tids = []
filter_pass for tid in tids:
= __P_rgs_df[__P_rgs_df["Target_ID"] == tid]
tid_df = len(set(tid_df["Gene"]))
N_genes_hit = len(set(self.gene_stats_df[self.gene_stats_df["Target_ID"] == tid]["type"]))
N_genes = len(set(tid_df["Read_ID"]))
N_reads if N_reads > 10:
= LemurRunEnv.get_expected_gene_hits(N_genes, N_reads)
E_N_genes_hit, variance if (N_genes_hit / E_N_genes_hit > 0.7 or E_N_genes_hit - N_genes_hit <= 3 * variance) and N_genes_hit > 1:
True)
filter_pass.append(else:
False)
filter_pass.append(else:
True if N_reads > 0 else False)
filter_pass.append(self.F = self.F.loc[filter_pass]
= -np.inf
lli = 1
i while True:
self.log(f"Starting EM iteration {i}", logging.DEBUG)
self.EM_step()
= self.compute_loglikelihood()
new_lli = new_lli - lli
lli_delta self.log(f"Current log likelihood {new_lli}", logging.DEBUG)
self.log(f"LLI delta is {lli_delta}", logging.DEBUG)
= new_lli
lli
self.first_EM_df = self.df_taxonomy.merge(self.F, how='right', left_index=True, right_index=True).reset_index()
if self.args.save_intermediate_profile:
= self.df_taxonomy.merge(self.F, how='right', left_index=True, right_index=True).reset_index()
intermediate_df f"{self.tsv_output_prefix}-EM-{i}.tsv", sep='\t', index=False)
intermediate_df.to_csv(
if lli_delta < self.lli_threshold:
self.log(f"Low abundance threshold: {self.low_abundance_threshold:.8f}", logging.INFO)
self.final_F = self.F.loc[self.F >= self.low_abundance_threshold]
self.EM_step(final=True)
self.freq_to_lineage_df()
self.collapse_rank()
return
+= 1
i
@staticmethod
def get_expected_gene_hits(N_genes, N_reads):
= N_genes * (1 - np.power(1 - 1 / N_genes, N_reads))
E = N_genes * np.power(1 - 1 / N_genes, N_reads) + \
V * N_genes * (1 - 1 / N_genes) * np.power(1 - 2 / N_genes, N_reads) - \
N_genes * N_genes * np.power((1 - 1 / N_genes), 2 * N_reads)
N_genes return E, V
def freq_to_lineage_df(self):
# Original relative abundance output (with F values)
= self.df_taxonomy.merge(self.final_F, how='right', left_index=True, right_index=True).reset_index()
results_df = f"{self.tsv_output_prefix}.tsv"
out_path ='\t', index=False)
results_df.to_csv(out_path, sepself.log(f"Relative abundance file generated: {out_path}", logging.INFO)
return results_df
def collapse_rank(self):
= pd.read_csv(f"{self.tsv_output_prefix}.tsv", sep='\t')
df_emu if self.rank not in self.TAXONOMY_RANKS:
raise ValueError("Specified rank must be in list: {}".format(self.TAXONOMY_RANKS))
= self.TAXONOMY_RANKS[self.TAXONOMY_RANKS.index(self.rank):]
keep_ranks for keep_rank in keep_ranks.copy():
if keep_rank not in df_emu.columns:
keep_ranks.remove(keep_rank)if "estimated counts" in df_emu.columns:
= df_emu[['F', 'estimated counts'] + keep_ranks]
df_emu_copy = df_emu_copy.replace({'-': 0})
df_emu_copy = df_emu_copy.astype({'F': 'float', 'estimated counts': 'float'})
df_emu_copy else:
= df_emu[['F'] + keep_ranks]
df_emu_copy = df_emu_copy.replace({'-': 0})
df_emu_copy = df_emu_copy.astype({'F': 'float'})
df_emu_copy = df_emu_copy.groupby(keep_ranks, dropna=False).sum()
df_emu_copy = f"{self.tsv_output_prefix}-{self.rank}.tsv"
output_path ='\t')
df_emu_copy.to_csv(output_path, sepself.log(df_emu_copy.nlargest(30, ["F"]), logging.DEBUG)
self.log(f"Collapsed rank file generated: {output_path}\n", logging.DEBUG)
def write_abundance_counts(self):
# New function: produce estimated counts in a new output file.
= self.df_taxonomy.merge(self.final_F, how='right', left_index=True, right_index=True).reset_index()
results_df = len(set(self.P_rgs_df.reset_index()["Read_ID"]))
n_reads "estimated counts"] = (results_df["F"] * n_reads).round().astype(int)
results_df[= results_df.drop(columns=["F"])
results_df = os.path.join(self.args.output, f"{self.input_basename}_{self.args.abundance_counts}")
out_path ='\t', index=False)
results_df.to_csv(out_path, sepself.log(f"Abundance counts file generated: {out_path}", logging.INFO)
return results_df
def logSumExp_ReadId(readid, ns):
= np.max(ns)
__max if not np.isfinite(__max):
= 0
__max = ns - __max
ds with np.errstate(divide='ignore'):
= np.exp(ds).sum()
sumOfExp return readid, __max + np.log(sumOfExp)
def EM_get_Prgt_Ft_sums(P_tgr, nthreads):
= P_tgr["Read_ID"].to_numpy().astype(np.str_)
Read_ID = P_tgr["P(r|t)*F(t)"].to_numpy()
Prgt_Ft = Read_ID.argsort()
Read_ID_as = Read_ID[Read_ID_as]
Read_ID = Prgt_Ft[Read_ID_as]
Prgt_Ft = np.unique(Read_ID, return_index=True)
Read_ID_unq, Read_ID_Idx = list(zip(Read_ID_unq.tolist(), np.split(Prgt_Ft, Read_ID_Idx[1:])))
map_args = Pool(nthreads)
thread_pool = thread_pool.starmap(logSumExp_ReadId, map_args)
map_res
thread_pool.close()
thread_pool.join()= pd.DataFrame(map_res, columns=["Read_ID", "P(r|t)*F(t)"], dtype='O')
P_tgr_sum 'P(r|t)*F(t)'] = P_tgr_sum['P(r|t)*F(t)'].astype(float)
P_tgr_sum[= P_tgr_sum["Read_ID"].to_numpy()
rid if not np.all(rid[:-1] < rid[1:]):
"Read_ID"], inplace=True)
P_tgr_sum.sort_values(['Read_ID'], inplace=True)
P_tgr_sum.set_index([return P_tgr_sum
def main():
= LemurRunEnv()
run
if not run.args.sam_input:
f"Starting run of minimap2 at {datetime.datetime.now()}", logging.INFO)
run.log(= time.time_ns()
ts
run.run_minimap2()= time.time_ns()
t0 f"Finished running minimap2 in {(t0 - ts) / 1000000.0:.3f} ms", logging.INFO)
run.log(
= time.time_ns()
t0
run.init_taxonomy()= time.time_ns()
t1 f"Finished loading taxonomy in {(t1 - t0) / 1000000.0:.3f} ms", logging.INFO)
run.log(
run.init_F()= time.time_ns()
t2 f"Finished initializing F in {(t2 - t1) / 1000000.0:.3f} ms", logging.INFO)
run.log(
run.build_alignment_model()= time.time_ns()
t3 f"Finished building alignment model in {(t3 - t2) / 1000000.0:.3f} ms", logging.INFO)
run.log(
run.build_P_rgs_df()= time.time_ns()
t4 f"Finished constructing P(r|t) in {(t4 - t3) / 1000000.0:.3f} ms", logging.INFO)
run.log(
run.EM_complete()= time.time_ns()
t5 f"Finished EM in {(t5 - t4) / 1000000.0:.3f} ms", logging.INFO)
run.log(
# Write original outputs:
run.freq_to_lineage_df()
run.collapse_rank()# New output: abundance_counts.tsv (or filename as specified by --abundance-counts) with input basename label
run.write_abundance_counts()
if not run.args.keep_alignments:
os.remove(run.sam_path)
0)
exit(
if __name__ == "__main__":
main()
Move all your abundance_counts.tsv files to a single directory and run this script point to that dir. This will make the OTU and TAX tables needed for phyloseq.
#!/usr/bin/env python3
import pandas as pd
import glob
import os
import argparse
def main():
= argparse.ArgumentParser(
parser ="Combine Lemur classification outputs into OTU and Taxonomy tables for phyloseq using Target_ID as the OTU identifier."
description
)
parser.add_argument("-i", "--input", required=True,
help="Input directory containing the Lemur output files (tab-delimited)."
)
parser.add_argument("--otu", default="otu_table.tsv",
help="Output filename for the OTU table (default: otu_table.tsv)"
)
parser.add_argument("--tax", default="tax_table.tsv",
help="Output filename for the Taxonomy table (default: tax_table.tsv)"
)
parser.add_argument("--pattern", default="*.tsv",
help="File pattern to match input files (default: *.tsv)"
)= parser.parse_args()
args
# Dictionaries to hold OTU abundance data and taxonomy info.
= {} # Mapping Target_ID -> {sample: summed estimated counts}
otu_dict = {} # Mapping Target_ID -> taxonomy information (dictionary)
tax_dict = []
sample_names
= glob.glob(os.path.join(args.input, args.pattern))
file_list if not file_list:
print("No files found matching the pattern in the input directory.")
return
# Define the required taxonomy columns that are expected in each input file.
# (These names should match the headers in your Lemur outputs.)
= ["species", "genus", "family", "order", "class", "phylum", "superkingdom"]
required_tax_cols # Define the desired output taxonomy column names and which input column provides each.
= {
desired_columns "Kingdom": "superkingdom",
"Phylum": "phylum",
"Class": "class",
"Order": "order",
"Family": "family",
"Genus": "genus",
"Species": "species"
}
for file in file_list:
try:
= pd.read_csv(file, sep="\t", header=0)
df except Exception as e:
print(f"Error reading {file}: {e}")
continue
# Verify required columns.
if "Target_ID" not in df.columns:
print(f"Column 'Target_ID' not found in {file}. Skipping this file.")
continue
if "estimated counts" not in df.columns:
print(f"Column 'estimated counts' not found in {file}. Skipping this file.")
continue
# Use the file's basename (without extension) as the sample name.
= os.path.splitext(os.path.basename(file))[0]
sample_name
sample_names.append(sample_name)
# Check if all required taxonomy columns exist in the file.
= [col for col in required_tax_cols if col not in df.columns]
missing if missing:
print(f"Missing taxonomy columns {missing} in file {file}. Skipping taxonomy info for this file.")
continue
# Group by Target_ID to sum the estimated counts for this sample.
= df.groupby("Target_ID", as_index=False)["estimated counts"].sum()
grouped
# Get taxonomy info for each OTU by taking the first occurrence for the required taxonomy columns.
= df.groupby("Target_ID", as_index=False)[required_tax_cols].first().set_index("Target_ID")
tax_info
# Reformat taxonomy info to the desired columns (and order).
= pd.DataFrame(index=tax_info.index)
tax_info_new for new_col, orig_col in desired_columns.items():
= tax_info[orig_col]
tax_info_new[new_col]
# Update dictionaries with the data from this file.
for _, row in grouped.iterrows():
= row["Target_ID"]
otu = row["estimated counts"]
count_val if otu not in otu_dict:
= {}
otu_dict[otu] = count_val
otu_dict[otu][sample_name]
# Save taxonomy info only the first time this OTU is encountered.
if otu not in tax_dict and otu in tax_info_new.index:
= tax_info_new.loc[otu].to_dict()
tax_dict[otu]
# Create OTU table DataFrame.
= pd.DataFrame.from_dict(otu_dict, orient="index")
otu_df # Reindex columns to ensure a consistent sample order.
= otu_df.reindex(columns=sample_names).fillna(0)
otu_df = "Target_ID"
otu_df.index.name
# Create taxonomy table DataFrame.
= pd.DataFrame.from_dict(tax_dict, orient="index")
tax_df = "Target_ID"
tax_df.index.name
# Write the tables to tab-delimited files.
="\t")
otu_df.to_csv(args.otu, sep="\t")
tax_df.to_csv(args.tax, sepprint(f"OTU table written to {args.otu}")
print(f"Taxonomy table written to {args.tax}")
if __name__ == "__main__":
main()
I usually just use excel or whatever is easiest.
This uses the OTU, TAX, and metadata files made prior.
# Read in the OTU table
# Assuming your OTU table is tab-delimited and the first column contains OTU IDs
otu_df <- read.delim("/home/tmhagm8/scratch/rice_hmri_concussion/data/otu_table.tsv",
header = TRUE, sep = "\t", row.names = 1)
# Convert the data frame to a matrix
otu_mat <- as.matrix(otu_df)
# Create the OTU table object (here rows = OTUs)
OTU <- otu_table(otu_mat, taxa_are_rows = TRUE)
# Read in the taxonomy table
# Again, assuming a tab-delimited file with OTU IDs as the first column
tax_df <- read.delim("/home/tmhagm8/scratch/rice_hmri_concussion/data/tax_table.tsv",
header = TRUE, sep = "\t", row.names = 1)
# Convert the taxonomy data to a character matrix
tax_mat <- as.matrix(tax_df)
# Create the tax_table object
TAX <- tax_table(tax_mat)
# Read in the sample metadata
# Here the metadata file is a CSV with a header and a column "SampleID" as the row names
meta <- read.csv("/home/tmhagm8/scratch/rice_hmri_concussion/data/metadata_phyloseq.csv",
header = TRUE, row.names = 1)
# Create the sample_data object
SAMPLES <- sample_data(meta)
# Combine all parts into a phyloseq object
physeq <- phyloseq(OTU, TAX, SAMPLES)
# Optional: print the phyloseq object to inspect it
print(physeq)
Check your samples and amke sure everything looks okay, in our case post_B023 had very little counts for anything so we’ll remove before saving the phyloseq object.
physeq <- physeq |>
tax_fix()
#View(tax_table(physeq))
#View(otu_table(physeq))
#View(sample_data(physeq))
physeq_filt = subset_samples(physeq, sample_names(physeq)!="post_B023_final_abundance_counts")
#View(otu_table(physeq_filt))
# And save the physeq as an RDS so we can use it in ancombc2 analysis locally on my macbook
saveRDS(physeq_filt, file = "25_02_21_lemur_physeq.rds")
ps = physeq_filt
Up to this point our work is solid, now lets analyze the diversity. I implemented MicroViz for clr relative abundance and ordination plotting, and diversity estimations with the use of ANCOMBC2 for differential abundance approximations.
For all microbiome analyses How to pick a threshold? –> Depends on what analysis method you are filtering for!
If you’re unsure, check out the MicroViz documentation or read some Amy Willis’ statdivlab papers!
Shannon Diversity Calculation
We took the shannon_species value for each covariate and plotted the values in our Prism document.
# Calculate Shannon
shan_data <- ps |>
ps_calc_diversity(rank = "Species", index = "shannon") |>
samdat_tbl()
shan_filt <- shan_data |>
select(Disease, shannon_Species)
write.csv(shan_filt, file = "/condo/neurobiome/tmhagm8/rice_hmri_concussion/data/rice_concuss_25_shannon.csv")
Chao1 Richness Calculation
Same thing this time except using the Chao1 index
# Calculate Chao1 richness
chao1_data <- ps |>
ps_calc_richness(rank = "Species", index = "chao1") |>
samdat_tbl()
chao1_filt <- chao1_data |>
select(Disease, chao1_Species)
write.csv(chao1_filt, file = "/condo/neurobiome/tmhagm8/rice_hmri_concussion/data/rice_concuss_25_chao1.csv")
Intro to beta diversity
This is useful for seeing where we want to filter our data for beta diversity (only light filtering).
# gotta start of with read filtering
ps |>
ps_mutate(reads = sample_sums(ps)) |> #this will get you read depth!
samdat_tbl() |>
ggplot(aes(x = reads)) +
geom_freqpoly(bins = 100) +
geom_rug(alpha = 0.5) +
scale_x_log10(labels = scales::label_number()) +
labs(x = "Number of classified reads", y = NULL) +
theme_bw()
# lets find where we should filter our data
ps_Stats <- tibble(
taxon = taxa_names(ps),
prevalence = microbiome::prevalence(ps),
total_abundance = taxa_sums(ps)
)
# plot
p <- ps_Stats |>
ggplot(aes(total_abundance, prevalence)) +
geom_point(alpha = 0.5) +
geom_rug(alpha = 0.1) +
scale_x_continuous(
labels = scales::label_number(), name = "Total Abundance"
) +
scale_y_continuous(
labels = scales::label_percent(), breaks = scales::breaks_pretty(n = 9),
name = "Prevalence (%)",
sec.axis = sec_axis(
trans = ~ . * nsamples(ps), breaks = scales::breaks_pretty(n = 9),
name = "Prevalence (N samples)"
)
) +
theme_bw()
# and add the taxa labels
p + ggrepel::geom_text_repel(
data = function(df) filter(df, total_abundance > 1e9 | prevalence > 0.6),
mapping = aes(label = taxon), size = 2.5, min.segment.length = 0, force = 15)
#ggsave("div_stats.png", height = 5, width = 7, units = "in")
Now we have an idea on how to filter our data for beta diversity analyses, we start by subsetting for Treatments then plot using weighted unifrac distance PCoA. We also compute the permanova values between groups which was added to the figure later using CorelDRAW.
ps %>%
tax_transform("identity", rank = "Species") %>%
dist_calc(dist = "aitchison") %>%
ord_calc("PCoA") %>%
ord_plot(color = "Disease", size = 2) +
scale_colour_brewer(palette = "Dark2", aesthetics = c("fill", "colour"), name = "State") +
theme_bw() +
ggside::geom_xsidedensity(aes(fill = Disease), alpha = 0.5, show.legend = FALSE) +
ggside::geom_ysidedensity(aes(fill = Disease), alpha = 0.5, show.legend = FALSE) +
ggside::theme_ggside_void()
ggsave("species_aitch_pcoa.png", height = 5, width = 7, units = "in")
Super useful to make your own palette and assign to certain taxa, this helps your viewers understand trends in this data. I altered this figure in coreldraw to widen it out and turn it vertically (with the facet labels).
myPal <- tax_palette(
data = ps, rank = "Genus", n = 25, pal = "greenArmytage",
add = c(Other = "gray")
)
tax_palette_plot(myPal)
ps |>
comp_barplot(
tax_level = "Genus", n_taxa = 10,
taxa_order = sum,
sample_order = "bray", bar_outline_colour = NA,
) +
facet_grid(
rows = vars(Disease),
scales = "free", space = "free" # these options are critically important!
) +
coord_flip() +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
ggsave("relabund_genus.png", height = 5, width = 7, units = "in")
Heatmaps of count data
Heatmap of abundance with treatment groups, this is a heavily filtered subset but is needed for this type of figure. I prettied this up in coreldraw just by removing the legend and changing the order of the samples. Was not used in our manuscript but still nice to have
# I do not know why this doesnt print when ggsaving so to save the figure i run this in a separate R script and save as pdf that way.
# Assign colors for Injury states
cols_disease <- distinct_palette(n = length(unique(samdat_tbl(ps)$Disease)), add = NA)
names(cols_disease) <- unique(samdat_tbl(ps)$Disease)
# Create the heatmap with both Injury and Treatment annotations
ps |>
# Sort samples by similarity
ps_seriate(rank = "Genus", tax_transform = "compositional", dist = "bray") |>
# Arrange samples by Injury state
ps_arrange(Disease) |>
# Apply CLR transformation
tax_transform("clr", rank = "Genus") |>
# Filter taxa by prevalence and abundance
tax_filter(min_prevalence = 0.2, verbose = FALSE, min_total_abundance = 100) |>
# Generate compositional heatmap
comp_heatmap(
tax_anno = taxAnnotation(
Prev. = anno_tax_prev(bar_width = 0.3, size = grid::unit(1, "cm"))
),
sample_anno = sampleAnnotation(
InjuryState = anno_sample("Disease"),
col = list(InjuryState = cols_disease),
border = FALSE
),
colors = heat_palette(palette = "RdBu", rev = TRUE),
sample_seriation = "Identity" # Keep original sample order
)
ggsave("nfl_htmp_genus.png", height = 5, width = 7, units = "in")
Differential abundance analysis for the three severity levels we investigated. For this analysis I wrote the outputs to a csv file and visualized them using prism but this is more than possible in R. I recommend using the ANCOMBC2 bioconductor tutorial to do this if you so please.
For some reason my ANCOMBC2 does not work on my institute R server so I just run this on my macbook.
# round otu counts
round_ps < round(otu_table(ps))
# Update sample_data in phyloseq object
good_ps <- phyloseq(round_ps, ps_old@tax_table, ps_old@sam_data)
good_ps_filt <- good_ps |>
subset_samples(Injury == "TBI_2X" & Treatment == "VEH")
# Read in phyloseq for ancombc
tse <- mia::makeTreeSummarizedExperimentFromPhyloseq(good_ps_filt)
# Subset to lean, overweight, and obese subjects
tse <- tse[, tse$Timepoint %in% c("Baseline", "After_Treatment", "5dpi")]
# To manually change the reference level, for instance, setting `obese`
# as the reference level, use:
tse$Timepoint <- factor(tse$Timepoint, levels = c("Baseline", "After_Treatment", "5dpi"))
# set seed for ancombc
set.seed(123)
# run ancombc
output3 <- ancombc2(data = tse, assay_name = "counts", tax_level = "Genus",
fix_formula = "Timepoint", rand_formula = NULL,
p_adj_method = "holm", pseudo_sens = TRUE,
prv_cut = 0.1, lib_cut = 10000, s0_perc = 0.05,
group = "Timepoint", struc_zero = TRUE, neg_lb = TRUE,
alpha = 0.05, n_cl = 2, verbose = TRUE,
global = TRUE, pairwise = TRUE, dunnet = TRUE, trend = TRUE,
iter_control = list(tol = 1e-2, max_iter = 20,
verbose = TRUE),
em_control = list(tol = 1e-5, max_iter = 100),
lme_control = lme4::lmerControl(),
mdfdr_control = list(fwer_ctrl_method = "holm", B = 100),
trend_control = list(contrast = list(matrix(c(1, 0, -1, 1),
nrow = 2,
byrow = TRUE),
matrix(c(-1, 0, 1, -1),
nrow = 2,
byrow = TRUE),
matrix(c(1, 0, 1, -1),
nrow = 2,
byrow = TRUE)),
node = list(2, 2, 1),
solver = "ECOS",
B = 100))
# How many structural zeroes?
tab_zero <- output3$zero_ind
tab_zero |>
datatable(caption = "The detection of structural zeros")
#Print output to dataframe
res_prim3 <- output3$res
#Save ancombc stats for each subset
#write.csv(res_prim3, file = "25_01_03_depl_2XTBI_VEH_timepoint_ancom.csv", row.names = FALSE)
Here is an example of a plot using this data that was made in GraphPad Prism
We used the same fastq data that was preprocessed and implemented in lemur within the aviary v.0.10.0 pipeline to process the data the rest of the way (assembly, binning, mapping, classification, etc.)
#! /bin/bash
#SBATCH --time=04-00:00:00
#SBATCH --partition=defq
#SBATCH --mail-user=agmarshall@houstonmethodist.org
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --ntasks-per-node=64
#SBATCH --mem=192GB
#SBATCH --nodes=1
#SBATCH --job-name=veh_mags
#SBATCH --comment=veh_mags
module load mamba
mamba activate aviary
aviary complete \
--longreads /path/to/veh_depl_final.fastq \
--skip-binners metabat \
--min-mean-q 6 \
--long_read_type ont_hq \
-t 32 \
-n 64 \
--tmp /path/to/aviary_veh_tmp \
--semibin-model mouse_gut \
--medaka-model r941_min_sup_g507 \
-o aviary_veh \
-w annotate
I like aviary as in my experience it produces MAGs that are much more complete when compared to other pipelines like SqueezeMeta but use whatever works for you. P.S. nf-core mag is a nice pipeline as well but does not have a nanopore (long-read only option).
Aviary does have an annotation pipeline that I did run, however, to get more accurate results on these annotations I also opted to use my groups software, seqscreen. BTW to run seqscreen, you must first convert your fastq data to fasta.
Running Bakta on high quality MAGs
#! /bin/bash
#SBATCH --time=04-00:00:00
#SBATCH --partition=defq
#SBATCH --mail-user=myemail@myemail.org
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --ntasks-per-node=64
#SBATCH --mem=192GB
#SBATCH --nodes=1
#SBATCH --job-name=bakta
#SBATCH --comment=bakta
module load mamba
mamba activate bakta
# Set your directories and parameters
INPUT_DIR="/path/to/abx_bins"
DB_PATH="/path/to/db"
EXT="fna" # file extension of your genome files
# Loop over all genome files in the input directory
for genome in "${INPUT_DIR}"/*.${EXT}; do
# Extract the sample name by stripping the directory and file extension
sample=$(basename "${genome}" .${EXT})
# Run bakta on the current genome, output to a directory named after the sample
bakta \
--db "${DB_PATH}" \
--output "${sample}" \
--meta \
--species "${sample}" \
"${genome}"
echo "Processed sample: ${sample}"
done
To better understand the functions present within our metagenomic datasets we used SeqScreen. Following the script below:
#! /bin/bash
#SBATCH --time=04-00:00:00
#SBATCH --partition=bigmemq
#SBATCH --mem=512GB
#SBATCH --mail-user=agmarshall@houstonmethodist.org
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --ntasks-per-node=110
#SBATCH --nodes=1
#SBATCH --job-name=seqscreen
#SBATCH --comment=seqscreen
module load mamba
mamba activate 16s
seqkit fq2fa /path/to/veh_depl_final.fastq -o /path/to/veh_depl_final.fasta
mamba activate seqscreen
seqscreen \
--ont \
--fasta /path/to/veh_depl_final.fasta \
--databases /path/to/SeqScreenDB_23.3 \
--working veh_depl_seqscreen \
--threads 110
Now to parse the seqscreen results I used an R script as the functional annotation results was a 1.4Gb .txt file. This is a very useful script as it can parse whatever GO terms, patric IDs, or Uniprot IDs you may be looking for and compare between two seqscreen dataframes. Here we used this to compare the number of GO term hits related to acetate, propionate, butyrate, and broad range SCFA between our vehicle and antibiotic groups (5dpi mice).
To use this R script you’ll have to have processed two groups you want to compare, in this document I only show the processing of one group but just repeat for the other group.
# seqscreen scaping and plotting to percentage
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggprism)
library(data.table)
#----------------------------
# 1. Read data
#----------------------------
df_veh <- fread("/path/to/functional_results.txt", check.names = FALSE, header = TRUE)
df_abx <- fread("/path/to/functional_results.txt", check.names = FALSE, header = TRUE)
scfa_go_terms <- c(
# Acetate-related
"GO:0006083", # acetate metabolic process
"GO:0006846", # acetate transport
"GO:0019427", # acetyl-CoA biosynthetic process from acetate
# Propionate-related
"GO:0019679", # propionate metabolic process
"GO:0019629", # propionate catabolic process, 2-methylcitrate cycle
"GO:0019543", # propionate catabolic process
# Butyrate-related
"GO:0019605", # butyrate metabolic process
"GO:0047761", # butyrate kinase activity
"GO:0046359", # butyrate catabolic process
# Others
"GO:0046459", # short-chain fatty acid metabolic process
"GO:0051790", # short-chain fatty acid biosynthetic process
"GO:0019626" # short-chain fatty acid catabolic process
)
# Pattern for grep search
scfa_pattern <- paste(scfa_go_terms, collapse = "|")
# Filter for SCFA-related GO terms
df_veh_filt <- df_veh[grepl(scfa_pattern, df_veh$go, ignore.case = TRUE)]
df_abx_filt <- df_abx[grepl(scfa_pattern, df_abx$go, ignore.case = TRUE)]
# Separate rows for multiple GO terms in a single row
df_veh_sep <- df_veh_filt |>
separate_rows(go, sep = ";")
df_abx_sep <- df_abx_filt |>
separate_rows(go, sep = ";")
# Define SCFA groups
acetate <- c("GO:0006083", "GO:0006846", "GO:0019427")
propionate <- c("GO:0019679", "GO:0019629", "GO:0019543")
butyrate <- c("GO:0019605", "GO:0047761", "GO:0046359")
others <- c("GO:0046459", "GO:0051790", "GO:0019626")
scfa_groups <- data.frame(
go = c(acetate, propionate, butyrate, others),
group = c(
rep("acetate", length(acetate)),
rep("propionate", length(propionate)),
rep("butyrate", length(butyrate)),
rep("others", length(others))
)
)
# Tag samples
df_veh_sep$sample <- "vehicle"
df_abx_sep$sample <- "antibiotics"
# Combine
df_combined <- rbind(df_veh_sep, df_abx_sep)
# Left join with the group definitions
df_annotated <- df_combined |> left_join(scfa_groups, by = "go")
#----------------------------
# 2. Count hits by group
#----------------------------
df_counts <- df_annotated |>
filter(!is.na(group)) |>
group_by(sample, group) |>
summarize(num_hits = n(), .groups = "drop")
# Define factor order
df_counts$group <- factor(
df_counts$group,
levels = c("acetate", "butyrate", "propionate", "others")
)
df_counts$sample <- factor(
df_counts$sample,
levels = c("vehicle", "antibiotics")
)
#----------------------------
# 3. Normalization to Relative Abundance (%)
#----------------------------
# Add the read counts calculated from coverm_abundances.log here
read_depths <- data.frame(
sample = c("vehicle", "antibiotics"),
total_reads = c(2768561, 4024478)
)
# Join read-depth info to your df_counts and calculate normalized counts per million reads
df_counts_norm <- df_counts |>
left_join(read_depths, by = "sample") |>
mutate(num_hits_norm = (num_hits / total_reads) * 1e6)
# Normalize counts so that each sample's groups sum to 100%
df_counts_norm <- df_counts_norm %>%
group_by(sample) %>%
mutate(percentage = (num_hits_norm / sum(num_hits_norm)) * 100) %>%
ungroup()
#----------------------------
# 4. Plot normalized relative abundances
#----------------------------
# Fix the order of samples
df_counts_norm$sample <- factor(
df_counts_norm$sample,
levels = c("vehicle", "antibiotics") # vehicle first, then antibiotics
)
# Define custom colors for SCFA groups
custom_colors_groups <- c(
"acetate" = "#0072B2", # blue
"propionate" = "#E69F00", # orange
"butyrate" = "#009E73", # green
"others" = "#CC79A7" # reddish-purple
)
# Create a stacked bar plot of relative abundances
ggplot(df_counts_norm, aes(x = sample, y = percentage, fill = group)) +
geom_col() + # Stacked bars by group
labs(
x = "Treatment Group",
y = "Relative Abundance (%)",
fill = "SCFA Group",
title = "Normalized SCFA-related SeqScreen GO Term Hits"
) +
scale_fill_manual(values = custom_colors_groups) +
theme_prism()
#ggsave("ss_goterm_counts_stacked.png", height = 8, width = 6, units = "in")
I also ran a few other programs to analyze this data and it probably won’t make it into the paper but are good resources on how to use them.
Making a sankey plot from the Lemur output
I made this python script to plot the taxa present in our treatment groups after minimum abundance filtering.
#!/usr/bin/env python3
"""
Compare two LEMUR-like TSV files (e.g., Vehicle and Treatment),
filter by abundance (F), and build a Sankey diagram with an extra
column showing whether a taxon is unique or shared.
Usage:
python script.py <vehicle_input.tsv> <treatment_input.tsv> [output.html] [output.pdf] [output.svg]
Requires:
Make a python venv and run:
pip install plotly pandas kaleido
Will take two Lemur outputs for input and filter by threshold (default 1.5% species abundance)
and show shared taxa between files in sankey
"""
import sys
import pandas as pd
import plotly.graph_objects as go
def main():
# 1) Parse arguments
if len(sys.argv) < 3:
print(f"Usage: {sys.argv[0]} <vehicle_input.tsv> <treatment_input.tsv> [output.html] [output.pdf] [output.svg]")
1)
sys.exit(
= sys.argv[1]
vehicle_tsv = sys.argv[2]
treatment_tsv = sys.argv[3] if len(sys.argv) > 3 else "sankey.html"
output_html = sys.argv[4] if len(sys.argv) > 4 else None
output_pdf = sys.argv[5] if len(sys.argv) > 5 else None
output_svg
# 2) Read the two TSV files
= pd.read_csv(vehicle_tsv, sep='\t', header=0, dtype=str)
df_vehicle = pd.read_csv(treatment_tsv, sep='\t', header=0, dtype=str)
df_treatment
# Convert the numeric column `F` to float and fill NaNs with 0.0
"F"] = pd.to_numeric(df_vehicle["F"], errors="coerce").fillna(0.0)
df_vehicle["F"] = pd.to_numeric(df_treatment["F"], errors="coerce").fillna(0.0)
df_treatment[
# 3) Filter rows in both files by the same threshold (e.g., F >= 1.5% abundant)
= 0.015
threshold = df_vehicle[df_vehicle["F"] >= threshold].copy()
df_vehicle = df_treatment[df_treatment["F"] >= threshold].copy()
df_treatment
# 4) Add a new column to mark which dataset each row belongs to
"Condition"] = "Vehicle"
df_vehicle["Condition"] = "Treatment"
df_treatment[
# 5) Define the taxonomic columns we want to track
= ["superkingdom", "phylum", "class", "order",
tax_levels "family", "genus", "species"]
# Ensure the taxonomic columns exist, fill missing with ""
for col in tax_levels:
if col not in df_vehicle.columns:
= ""
df_vehicle[col] if col not in df_treatment.columns:
= ""
df_treatment[col]
# 6) Combine both dataframes
= pd.concat([df_vehicle, df_treatment], ignore_index=True)
df_all
# 7) Identify which taxa (based on the full path) are in each set
def get_tax_id(row):
return tuple(row[level] for level in tax_levels)
= set(get_tax_id(row) for _, row in df_vehicle.iterrows())
vehicle_taxa = set(get_tax_id(row) for _, row in df_treatment.iterrows())
treatment_taxa
# 8) Determine Overlap: "Vehicle only", "Treatment only", or "Both"
def overlap_label(condition, tax_id):
= (tax_id in vehicle_taxa)
in_vehicle = (tax_id in treatment_taxa)
in_treatment if in_vehicle and in_treatment:
return "Both"
elif in_vehicle and not in_treatment:
return "Vehicle only"
elif not in_vehicle and in_treatment:
return "Treatment only"
else:
# Shouldn't happen if we're only dealing with filtered data in df_all
return "None"
"Overlap"] = df_all.apply(
df_all[lambda row: overlap_label(row["Condition"], get_tax_id(row)), axis=1
)
# 9) Build the path for the Sankey. We'll add "Condition" at the start,
# then the standard taxonomic columns, then "Overlap" at the end.
= ["Condition"] + tax_levels + ["Overlap"]
sankey_levels
= {}
link_abundance for _, row in df_all.iterrows():
= row["F"]
abundance = [row[level] for level in sankey_levels]
path
for i in range(len(path) - 1):
= path[i]
src = path[i + 1]
tgt
if src == "" or tgt == "":
continue
= (
link_abundance[(src, tgt)] 0.0) + abundance
link_abundance.get((src, tgt),
)
# 10) Collect all nodes and build index maps
= set()
all_nodes for (src, tgt) in link_abundance.keys():
all_nodes.add(src)
all_nodes.add(tgt)= list(all_nodes)
all_nodes = {n: i for i, n in enumerate(all_nodes)}
node_index
= []
sources = []
targets = []
values
for (src, tgt), val in link_abundance.items():
sources.append(node_index[src])
targets.append(node_index[tgt])
values.append(val)
# 11) Create the Sankey figure
= go.Figure(
fig =[go.Sankey(
data="snap",
arrangement=dict(
node=15,
pad=20,
thickness=all_nodes,
label
),=dict(
link=sources,
source=targets,
target=values
value
)
)]
)
# Adjust margins to help avoid label clipping
fig.update_layout(="Vehicle vs. Treatment Sankey (with Overlap)",
title_text=12,
font_size=dict(l=250, r=50, t=50, b=50) # increase left margin for longer labels
margin
)
# 12) Save HTML
fig.write_html(output_html)print(f"Saved Sankey plot to {output_html}")
# 13) (Optional) Save PDF - requires kaleido installed
if output_pdf:
format="pdf")
fig.write_image(output_pdf, print(f"Saved Sankey plot to {output_pdf}")
# 14) (Optional) Save SVG - also requires kaleido
if output_svg:
format="svg", width=1600, height=1200)
fig.write_image(output_svg, print(f"Saved Sankey plot to {output_svg}")
if __name__ == "__main__":
main()
And to run this script save it as lemur_sankey.py
python -m venv lemur_sankey
source lemur_sankey/bin/activate
pip install plotly pandas kaleido
chmod +x lemur_sankey.py
./lemur_sankey.py vehicle_input.tsv treatment_input.tsv output.html output.pdf output.svg
I also ran rhea to try to see if there are any differences in the assembly graphs relating to the SNPs of genomes that were found in common between treatment groups.
To get rhea working, I installed the dependencies and made a virtual env using the code below
git clone https://github.com/treangenlab/rhea.git
git clone https://github.com/fenderglass/Flye
cd Flye
make
git clone https://github.com/lh3/minigraph
cd minigraph && make
python -m venv rhea_env
source rhea_env/bin/activate
pip install numpy pandas seaborn networkx
cd rhea
python rhea.py --help
And ran rhea using the code here
#! /bin/bash
#SBATCH --time=04-00:00:00
#SBATCH --partition=bigmemq
#SBATCH --mem=192GB
#SBATCH --mail-user=agmarshall@houstonmethodist.org
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --ntasks-per-node=64
#SBATCH --nodes=1
#SBATCH --job-name=rhea
#SBATCH --comment=rhea
source /home/tmhagm8/scratch/rhea_env/bin/activate
python /path/to/rhea/rhea.py \
--type nano-hq \
--flye-exec /path/to/Flye/bin/flye \
--minigraph-exec /path/to/minigraph/minigraph \
--threads 64 \
--output-dir /path/to/rhea_outdir \
\
/path/to/veh_depl_final.fastq /path/to/abx_depl_final.fastq
Let’s take a peek at the assembly graph for the ABX group, this was done with Bandage. We see a circular genome yayyy!