Full Interface¶

The API equivalent of the command line interface is shown in the API Demo.ipynb.

GPSeer: software to infer missing data in sparsely sampled genotype-phenotype maps.

usage: gpseer [-h] {fetch-example,estimate-ml,cross-validate} ...

Sub-commands:¶

fetch-example¶

Fetch example directory from Github.

gpseer fetch-example [-h] [--output-dir OUTPUT_DIR]

Named Arguments¶

--output-dir

folder to download the contents to.

Default: “examples”

estimate-ml¶

estimate-ml: GPSeer’s maximum likelihood calculator— predicts the maximum-likelihood estimates for missing phenotypes in a sparsely sampled genotype-phenotype map.

gpseer estimate-ml [-h] [--wildtype WILDTYPE] [--threshold THRESHOLD]
                   [--spline_order SPLINE_ORDER]
                   [--spline_smoothness SPLINE_SMOOTHNESS]
                   [--epistasis_order EPISTASIS_ORDER] [--alpha ALPHA]
                   [--overwrite] [--output_root OUTPUT_ROOT]
                   [--genotype_file GENOTYPE_FILE]
                   input_file

Positional Arguments¶

input_file: A CSV file containing the observed/measured genotype-phenotype map data.

Named Arguments¶

--wildtype: The reference/wildtype genotype. If this is not specified, GPSeer will use the first sequence in the input_file.
--threshold: The minimum quantitative phenotype value that is detectable or measurable. GPSeer will treat any phenotypes below this threshold as a separate class of data-points.
--spline_order: The order of the spline used to estimate the nonlinearity in the genotype-phenotype map. (k in scipy.interpolate.UnivariateSpline)
--spline_smoothness: The ‘smoothness’ parameter used to smooth the spline when estimating the nonlinearity in a genotype-phenotype map. (s in scipy.interpolate.UnivariateSpline. Increase this value if the software throws “Try raising the value of s when initializing the spline model”).

Default: 10
--epistasis_order: The order of epistasis to include in the linear, high-order epistasis model. We highly recommend leaving this value at 1, as the addition of pairwise and higher-ordered epistasis leads to overfitting and poor predictive power.

Default: 1
--alpha: Control parameter for Lasso regression. Multiplies L1 term. See: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html for details. (In our experience, leaving as default works well).

Default: 1
--overwrite: Overwrite existing output.

Default: False
--output_root: Root for all output files (e.g. {root}_predictions.csv, {root}_spline-fit.pdf, etc.). If none, this will be made from the input file name
--genotype_file: A text file with a list of genotypes to predict given the input_file and epistasis model.

cross-validate¶

cross-validate: Estimate the predictive power of a given model by generating multiple samples of the training + test subsets of the data and then calculating a training and test pearson coefficient for each sample.

gpseer cross-validate [-h] [--wildtype WILDTYPE] [--threshold THRESHOLD]
                      [--spline_order SPLINE_ORDER]
                      [--spline_smoothness SPLINE_SMOOTHNESS]
                      [--epistasis_order EPISTASIS_ORDER] [--alpha ALPHA]
                      [--overwrite] [--n_samples N_SAMPLES]
                      [--output_root OUTPUT_ROOT]
                      [--train_fraction TRAIN_FRACTION]
                      input_file

Positional Arguments¶

input_file: A CSV file containing the observed/measured genotype-phenotype map data.

Named Arguments¶

--wildtype: The reference/wildtype genotype. If this is not specified, GPSeer will use the first sequence in the input_file.
--threshold: The minimum quantitative phenotype value that is detectable or measurable. GPSeer will treat any phenotypes below this threshold as a separate class of data-points.
--spline_order: The order of the spline used to estimate the nonlinearity in the genotype-phenotype map. (k in scipy.interpolate.UnivariateSpline)
--spline_smoothness: The ‘smoothness’ parameter used to smooth the spline when estimating the nonlinearity in a genotype-phenotype map. (s in scipy.interpolate.UnivariateSpline. Increase this value if the software throws “Try raising the value of s when initializing the spline model”).

Default: 10
--epistasis_order: The order of epistasis to include in the linear, high-order epistasis model. We highly recommend leaving this value at 1, as the addition of pairwise and higher-ordered epistasis leads to overfitting and poor predictive power.

Default: 1
--alpha: Control parameter for Lasso regression. Multiplies L1 term. See: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html for details. (In our experience, leaving as default works well).

Default: 1
--overwrite: Overwrite existing output.

Default: False
--n_samples: A CSV file GPSeer will create with final predictions.

Default: 100
--output_root: Root for all output files (e.g. {root}_predictions.csv, {root}_spline-fit.pdf, etc.). If none, this will be made from the input file name.
--train_fraction: Fraction of data to include in training set.

Default: 0.8

Input/Output