Developer information

This section contains some information snippets for developers. Will be extended in the future.

Reading output formats

While TomoTwin writes files with various extensions (.temb, .tmap, .tloc, .tumap), they are basically all pickled pandas dataframes. They can all be read by:

import pandas as pd
df = pd.read_pickle("path/to/a/tomotwin/output/file")

In case you modify it, please also check the df.attrs dictionary (and copy it if necessary) of the dataframe. It contains important meta information that is used by TomoTwin.

Implementing new architectures

Adding new CNN architectures is straightforward in TomoTwin.

Add a class for your network in modules/networks/ and implement the interface defined by modules/networks/torchmodel.py
Add your new network to the network_identifier_map dictionary in the modules/networks/networkmanager.py module.
Create a network configuration file like in resources/configs/config_siamese.json. The network_config entry should match the __init__ method of your new network.

Now you are in principle set to train your network (see How to train TomoTwin).

How to train TomoTwin

Here we describe how to train the SiameseNet (bad name, as it is actually a tripletnetwork). Hardwarewise, 12GB of GPU memory should be enough.

1. Download training and validation data

Training and validation set can be found here:

https://zenodo.org/record/6637456

Download and untar training and validation data.

2. Download siamese network config

You find the configuration file here:

https://github.com/MPI-Dortmund/tomotwin-cryoet/blob/main/resources/configs/config_siamese.json

3. Start the training

To run it on one GPU for 300 epochs do:

CUDA_VISIBLE_DEVICES=0 tomotwin_train.py -v path/train/volumes/ --validvolumes path/valid/volumes/ -o out_train -nc path/to/siamese_network.json --epochs 300

How to evaluate TomoTwin

To evaluate TomoTwin you’ll need a tomogram to pick on, and ground truth coordinates of particles to compare TomoTwin’s pick against. You can find an example tomogram with references and ground truth positions here:

https://zenodo.org/records/15631632

Which you can download with:

mkdir eval
cd eval
wget https://zenodo.org/api/records/15631632/files-archive -O eval.zip
unzip eval.zip
mkdir refs
unzip references.zip
mv gen01_t*.mrc refs/

Note: this tomogram contains proteins that were not in the training/validation data used to train TomoTwin. Therefore, it is useful to assess the generalization of any models trained using our publicly available training data.

You can also download a json file containing predetermined boxsizes for all of the proteins in the training/validation/evaluation data with:

wget https://github.com/MPI-Dortmund/tomotwin-cryoet/blob/main/resources/boxsizes.json

To run the evaluation, you should use the reference-based workflow to generate a locate file (replacing the path to your model):

CUDA_VISIBLE_DEVICES=0,1 tomotwin_embed.py tomogram -m /path/to/model.pth -v tiltseries_rec.mrc -o ./ -b 256; CUDA_VISIBLE_DEVICES=0,1 tomotwin_embed.py subvolumes -m /path/to/model.pth -v refs/ -b 8 -o ./; tomotwin_map.py distance -r embeddings.temb -v tiltseries_rec_embeddings.temb --refine -o ./; tomotwin_locate.py findmax -m map.tmap -o ./ --write_heatmaps

Then to run the evaluation use:

tomotwin_scripts_evaluate.py positions -p particle_positions.txt -l located.tloc -s boxsizes.json --optim --stepsize_optim_similarity 0.01

The script will report the picking statistics for each protein in the locate file. The –optim flag will enable metric and size threshold optimization for each protein and the –stepsize_optim_similarity controls the step size for the metric threshold optimisation (default 0.05). Increasing the step size will result in the script running faster, but at the cost of reduced picking optimisation.