Full Interface

The API equivalent of the command line interface is shown in the API Demo.ipynb.

GPSeer: software to infer missing data in sparsely sampled genotype-phenotype maps.

usage: gpseer [-h] {fetch-example,estimate-ml,cross-validate} ...

Sub-commands:

fetch-example

Fetch example directory from Github.

gpseer fetch-example [-h] [--output-dir OUTPUT_DIR]

Named Arguments

--output-dir

folder to download the contents to.

Default: “examples”

estimate-ml

estimate-ml: GPSeer’s maximum likelihood calculator— predicts the maximum-likelihood estimates for missing phenotypes in a sparsely sampled genotype-phenotype map.

gpseer estimate-ml [-h] [--wildtype WILDTYPE] [--threshold THRESHOLD]
                   [--spline_order SPLINE_ORDER]
                   [--spline_smoothness SPLINE_SMOOTHNESS]
                   [--epistasis_order EPISTASIS_ORDER] [--alpha ALPHA]
                   [--overwrite] [--output_root OUTPUT_ROOT]
                   [--genotype_file GENOTYPE_FILE]
                   input_file

Positional Arguments

input_file

A CSV file containing the observed/measured genotype-phenotype map data.

Named Arguments

--wildtype

The reference/wildtype genotype. If this is not specified, GPSeer will use the first sequence in the input_file.

--threshold

The minimum quantitative phenotype value that is detectable or measurable. GPSeer will treat any phenotypes below this threshold as a separate class of data-points.

--spline_order

The order of the spline used to estimate the nonlinearity in the genotype-phenotype map. (k in scipy.interpolate.UnivariateSpline)

--spline_smoothness

The ‘smoothness’ parameter used to smooth the spline when estimating the nonlinearity in a genotype-phenotype map. (s in scipy.interpolate.UnivariateSpline. Increase this value if the software throws “Try raising the value of s when initializing the spline model”).

Default: 10

--epistasis_order

The order of epistasis to include in the linear, high-order epistasis model. We highly recommend leaving this value at 1, as the addition of pairwise and higher-ordered epistasis leads to overfitting and poor predictive power.

Default: 1

--alpha

Control parameter for Lasso regression. Multiplies L1 term. See: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html for details. (In our experience, leaving as default works well).

Default: 1

--overwrite

Overwrite existing output.

Default: False

--output_root

Root for all output files (e.g. {root}_predictions.csv, {root}_spline-fit.pdf, etc.). If none, this will be made from the input file name

--genotype_file

A text file with a list of genotypes to predict given the input_file and epistasis model.

cross-validate

cross-validate: Estimate the predictive power of a given model by generating multiple samples of the training + test subsets of the data and then calculating a training and test pearson coefficient for each sample.

gpseer cross-validate [-h] [--wildtype WILDTYPE] [--threshold THRESHOLD]
                      [--spline_order SPLINE_ORDER]
                      [--spline_smoothness SPLINE_SMOOTHNESS]
                      [--epistasis_order EPISTASIS_ORDER] [--alpha ALPHA]
                      [--overwrite] [--n_samples N_SAMPLES]
                      [--output_root OUTPUT_ROOT]
                      [--train_fraction TRAIN_FRACTION]
                      input_file

Positional Arguments

input_file

A CSV file containing the observed/measured genotype-phenotype map data.

Named Arguments

--wildtype

The reference/wildtype genotype. If this is not specified, GPSeer will use the first sequence in the input_file.

--threshold

The minimum quantitative phenotype value that is detectable or measurable. GPSeer will treat any phenotypes below this threshold as a separate class of data-points.

--spline_order

The order of the spline used to estimate the nonlinearity in the genotype-phenotype map. (k in scipy.interpolate.UnivariateSpline)

--spline_smoothness

The ‘smoothness’ parameter used to smooth the spline when estimating the nonlinearity in a genotype-phenotype map. (s in scipy.interpolate.UnivariateSpline. Increase this value if the software throws “Try raising the value of s when initializing the spline model”).

Default: 10

--epistasis_order

The order of epistasis to include in the linear, high-order epistasis model. We highly recommend leaving this value at 1, as the addition of pairwise and higher-ordered epistasis leads to overfitting and poor predictive power.

Default: 1

--alpha

Control parameter for Lasso regression. Multiplies L1 term. See: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html for details. (In our experience, leaving as default works well).

Default: 1

--overwrite

Overwrite existing output.

Default: False

--n_samples

A CSV file GPSeer will create with final predictions.

Default: 100

--output_root

Root for all output files (e.g. {root}_predictions.csv, {root}_spline-fit.pdf, etc.). If none, this will be made from the input file name.

--train_fraction

Fraction of data to include in training set.

Default: 0.8