The API equivalent of the command line interface is shown in the API Demo.ipynb.
GPSeer: software to infer missing data in sparsely sampled genotype-phenotype maps.
usage: gpseer [-h] {fetch-example,estimate-ml,cross-validate} ...
Fetch example directory from Github.
gpseer fetch-example [-h] [--output-dir OUTPUT_DIR]
folder to download the contents to.
Default: “examples”
estimate-ml: GPSeer’s maximum likelihood calculator— predicts the maximum-likelihood estimates for missing phenotypes in a sparsely sampled genotype-phenotype map.
gpseer estimate-ml [-h] [--wildtype WILDTYPE] [--threshold THRESHOLD] [--spline_order SPLINE_ORDER] [--spline_smoothness SPLINE_SMOOTHNESS] [--epistasis_order EPISTASIS_ORDER] [--alpha ALPHA] [--overwrite] [--output_root OUTPUT_ROOT] [--genotype_file GENOTYPE_FILE] input_file
A CSV file containing the observed/measured genotype-phenotype map data.
The reference/wildtype genotype. If this is not specified, GPSeer will use the first sequence in the input_file.
The minimum quantitative phenotype value that is detectable or measurable. GPSeer will treat any phenotypes below this threshold as a separate class of data-points.
The order of the spline used to estimate the nonlinearity in the genotype-phenotype map. (k in scipy.interpolate.UnivariateSpline)
The ‘smoothness’ parameter used to smooth the spline when estimating the nonlinearity in a genotype-phenotype map. (s in scipy.interpolate.UnivariateSpline. Increase this value if the software throws “Try raising the value of s when initializing the spline model”).
Default: 10
The order of epistasis to include in the linear, high-order epistasis model. We highly recommend leaving this value at 1, as the addition of pairwise and higher-ordered epistasis leads to overfitting and poor predictive power.
Default: 1
Control parameter for Lasso regression. Multiplies L1 term. See: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html for details. (In our experience, leaving as default works well).
Overwrite existing output.
Default: False
Root for all output files (e.g. {root}_predictions.csv, {root}_spline-fit.pdf, etc.). If none, this will be made from the input file name
A text file with a list of genotypes to predict given the input_file and epistasis model.
cross-validate: Estimate the predictive power of a given model by generating multiple samples of the training + test subsets of the data and then calculating a training and test pearson coefficient for each sample.
gpseer cross-validate [-h] [--wildtype WILDTYPE] [--threshold THRESHOLD] [--spline_order SPLINE_ORDER] [--spline_smoothness SPLINE_SMOOTHNESS] [--epistasis_order EPISTASIS_ORDER] [--alpha ALPHA] [--overwrite] [--n_samples N_SAMPLES] [--output_root OUTPUT_ROOT] [--train_fraction TRAIN_FRACTION] input_file
A CSV file GPSeer will create with final predictions.
Default: 100
Root for all output files (e.g. {root}_predictions.csv, {root}_spline-fit.pdf, etc.). If none, this will be made from the input file name.
Fraction of data to include in training set.
Default: 0.8