scikit-genome package¶
Module skgenome contents¶
Tabular file I/O (tabio)¶
tabio¶
I/O for tabular formats of genomic data (regions or features).
-
skgenome.tabio.read(infile, fmt='tab', into=None, sample_id=None, meta=None, **kwargs)[source]¶ Read tabular data from a file or stream into a genome object.
Supported formats: see READERS
If a format supports multiple samples, return the sample specified by sample_id, or if unspecified, return the first sample and warn if there were other samples present in the file.
- Parameters
infile (handle or string) – Filename or opened file-like object to read.
fmt (string) – File format.
into (class) – GenomicArray class or subclass to instantiate, overriding the default for the target file format.
sample_id (string) – Sample identifier.
meta (dict) – Metadata, as arbitrary key-value pairs.
**kwargs – Additional keyword arguments to the format-specific reader function.
- Returns
The data from the given file instantiated as into, if specified, or the default base class for the given file format (usually GenomicArray).
- Return type
GenomicArray or subclass
-
skgenome.tabio.read_auto(infile)[source]¶ Auto-detect a file’s format and use an appropriate parser to read it.
-
skgenome.tabio.safe_write(outfile, verbose=True)[source]¶ Write to a filename or file-like object with error handling.
If given a file name, open it. If the path includes directories that don’t exist yet, create them. If given a file-like object, just pass it through.
Base class: GenomicArray¶
The base class of the core objects used throughout CNVkit and scikit-genome is
GenomicArray. It wraps a pandas DataFrame
instance, which is accessible through the .data attribute and can be used
for any manipulations that aren’t already provided by methods in the wrapper
class.
gary¶
Base class for an array of annotated genomic regions.
-
class
skgenome.gary.GenomicArray(data_table, meta_dict=None)[source]¶ Bases:
objectAn array of genomic intervals. Base class for genomic data structures.
Can represent most BED-like tabular formats with arbitrary additional columns.
-
add(other)[source]¶ Combine this array’s data with another GenomicArray (in-place).
Any optional columns must match between both arrays.
-
add_columns(**columns)[source]¶ Add the given columns to a copy of this GenomicArray.
- Parameters
**columns (array) – Keyword arguments where the key is the new column’s name and the value is an array of the same length as self which will be the new column’s values.
- Returns
A new instance of self with the given columns included in the underlying dataframe.
- Return type
GenomicArray or subclass
-
as_dataframe(dframe, reset_index=False)[source]¶ Wrap the given pandas DataFrame in this instance’s metadata.
-
by_arm(min_gap_size=100000.0, min_arm_bins=50)[source]¶ Iterate over bins grouped by chromosome arm (inferred).
-
by_ranges(other, mode='outer', keep_empty=True)[source]¶ Group rows by another GenomicArray’s bin coordinate ranges.
For example, this can be used to group SNVs by CNV segments.
Bins in this array that fall outside the other array’s bins are skipped.
- Parameters
other (GenomicArray) – Another GA instance.
mode (string) –
Determines what to do with bins that overlap a boundary of the selection. Possible values are:
inner: Drop the bins on the selection boundary, don’t emit them.outer: Keep/emit those bins as they are.trim: Emit those bins but alter their boundaries to match the selection; the bin start or end position is replaced with the selection boundary position.
keep_empty (bool) – Whether to also yield other bins with no overlapping bins in self, or to skip them when iterating.
- Yields
tuple – (other bin, GenomicArray of overlapping rows in self)
-
property
chromosome¶
-
concat(others)[source]¶ Concatenate several GenomicArrays, keeping this array’s metadata.
This array’s data table is not implicitly included in the result.
-
coords(also=())[source]¶ Iterate over plain coordinates of each bin: chromosome, start, end.
- Parameters
also (str, or iterable of strings) – Also include these columns from self, in addition to chromosome, start, and end.
Example –
rows in BED format (yielding) –
probes.coords(also=["gene" (>>>) –
"strand"]) –
-
drop_extra_columns()[source]¶ Remove any optional columns from this GenomicArray.
- Returns
A new copy with only the minimal set of columns required by the class (e.g. chromosome, start, end for GenomicArray; may be more for subclasses).
- Return type
GenomicArray or subclass
-
property
end¶
-
filter(func=None, **kwargs)[source]¶ Take a subset of rows where the given condition is true.
- Parameters
func (callable) – A boolean function which will be applied to each row to keep rows where the result is True.
**kwargs (string) – Keyword arguments like
chromosome="chr7"orgene="Antitarget", which will keep rows where the keyed field equals the specified value.
- Returns
Subset of self where the specified condition is True.
- Return type
-
classmethod
from_columns(columns, meta_dict=None)[source]¶ Create a new instance from column arrays, given as a dict.
-
classmethod
from_rows(rows, columns=None, meta_dict=None)[source]¶ Create a new instance from a list of rows, as tuples or arrays.
-
in_range(chrom=None, start=None, end=None, mode='outer')[source]¶ Get the GenomicArray portion within the given genomic range.
- Parameters
chrom (str or None) – Chromosome name to select. Use None if self has only one chromosome.
start (int or None) – Start coordinate of range to select, in 0-based coordinates. If None, start from 0.
end (int or None) – End coordinate of range to select. If None, select to the end of the chromosome.
mode (str) – As in by_ranges:
outerincludes bins straddling the range boundaries,trimadditionally alters the straddling bins’ endpoints to match the range boundaries, andinnerexcludes those bins.
- Returns
The subset of self enclosed by the specified range.
- Return type
-
in_ranges(chrom=None, starts=None, ends=None, mode='outer')[source]¶ Get the GenomicArray portion within the specified ranges.
Similar to in_ranges, but concatenating the selections of all the regions specified by the starts and ends arrays.
- Parameters
chrom (str or None) – Chromosome name to select. Use None if self has only one chromosome.
starts (int array, or None) – Start coordinates of ranges to select, in 0-based coordinates. If None, start from 0.
ends (int array, or None) – End coordinates of ranges to select. If None, select to the end of the chromosome. If starts and ends are both specified, they must be arrays of equal length.
mode (str) – As in by_ranges:
outerincludes bins straddling the range boundaries,trimadditionally alters the straddling bins’ endpoints to match the range boundaries, andinnerexcludes those bins.
- Returns
Concatenation of all the subsets of self enclosed by the specified ranges.
- Return type
-
intersection(other, mode='outer')[source]¶ Select the bins in self that overlap the regions in other.
The extra fields of self, but not other, are retained in the output.
-
into_ranges(other, column, default, summary_func=None)[source]¶ Re-bin values from column into the corresponding ranges in other.
Match overlapping/intersecting rows from other to each row in self. Then, within each range in other, extract the value(s) from column in self, using the function summary_func to produce a single value if multiple bins in self map to a single range in other.
For example, group SNVs (self) by CNV segments (other) and calculate the median (summary_func) of each SNV group’s allele frequencies.
- Parameters
other (GenomicArray) – Ranges into which the overlapping values of self will be summarized.
column (string) – Column name in self to extract values from.
default – Value to assign to indices in other that do not overlap any bins in self. Type should be the same as or compatible with the output field specified by column, or the output of summary_func.
summary_func (callable, dict of string-to-callable, or None) –
Specify how to reduce 1 or more other rows into a single value for the corresponding row in self.
If callable, apply to the column field each group of rows in other column.
If a single-element dict of column name to callable, apply to that field in other instead of column.
If None, use an appropriate summarizing function for the datatype of the column column in other (e.g. median of numbers, concatenation of strings).
If some other value, assign that value to self wherever there is an overlap.
- Returns
The extracted and summarized values from self corresponding to other’s genomic ranges, the same length as other.
- Return type
pd.Series
-
iter_ranges_of(other, column, mode='outer', keep_empty=True)[source]¶ Group rows by another GenomicArray’s bin coordinate ranges.
For example, this can be used to group SNVs by CNV segments.
Bins in this array that fall outside the other array’s bins are skipped.
- Parameters
other (GenomicArray) – Another GA instance.
column (string) – Column name in self to extract values from.
mode (string) –
Determines what to do with bins that overlap a boundary of the selection. Possible values are:
inner: Drop the bins on the selection boundary, don’t emit them.outer: Keep/emit those bins as they are.trim: Emit those bins but alter their boundaries to match the selection; the bin start or end position is replaced with the selection boundary position.
keep_empty (bool) – Whether to also yield other bins with no overlapping bins in self, or to skip them when iterating.
- Yields
tuple – (other bin, GenomicArray of overlapping rows in self)
-
merge(bp=0, stranded=False, combine=None)[source]¶ Merge adjacent or overlapping regions into single rows.
Similar to ‘bedtools merge’.
-
resize_ranges(bp, chrom_sizes=None)[source]¶ Resize each genomic bin by a fixed number of bases at each end.
Bin ‘start’ values have a minimum of 0, and chrom_sizes can specify each chromosome’s maximum ‘end’ value.
Similar to ‘bedtools slop’.
- Parameters
bp (int) – Number of bases in each direction to expand or shrink each bin. Applies to ‘start’ and ‘end’ values symmetrically, and may be positive (expand) or negative (shrink).
chrom_sizes (dict of string-to-int) – Chromosome name to length in base pairs. If given, all chromosomes in self must be included.
-
property
sample_id¶
-
property
start¶
-
Genomic interval arithmetic¶
intersect¶
DataFrame-level intersection operations.
Calculate overlapping regions, similar to bedtools intersect.
The functions here operate on pandas DataFrame and Series instances, not GenomicArray types.
-
skgenome.intersect.by_ranges(table, other, mode, keep_empty)[source]¶ Group rows by another GenomicArray’s bin coordinate ranges.
-
skgenome.intersect.into_ranges(source, dest, src_col, default, summary_func)[source]¶ Group a column in source by regions in dest and summarize.
-
skgenome.intersect.iter_ranges(table, chrom, starts, ends, mode)[source]¶ Iterate through sub-ranges.
-
skgenome.intersect.iter_slices(table, other, mode, keep_empty)[source]¶ Yields indices to extract ranges from table.
Returns an iterable of integer arrays that can apply to Series objects, i.e. columns of table. These indices are of the DataFrame/Series’ Index, not array coordinates – so be sure to use DataFrame.loc, Series.loc, or Series getitem, as opposed to .iloc or indexing directly into Numpy arrays.
merge¶
DataFrame-level merging operations.
Merge overlapping regions into single rows, similar to bedtools merge.
The functions here operate on pandas DataFrame and Series instances, not GenomicArray types.
subdivide¶
DataFrame-level subdivide operation.
Split each region into similar-sized sub-regions.
The functions here operate on pandas DataFrame and Series instances, not GenomicArray types.
Helper modules¶
chromsort¶
Operations on chromosome/contig/sequence names.
-
skgenome.chromsort.detect_big_chroms(sizes)[source]¶ Determine the number of “big” chromosomes from their lengths.
In the human genome, this returns 24, where the canonical chromosomes 1-22, X, and Y are considered “big”, while mitochrondria and the alternative contigs are not. This allows us to exclude the non-canonical chromosomes from an analysis where they’re not relevant.
- Returns
n_big (int) – Number of “big” chromosomes in the genome.
thresh (int) – Length of the smallest “big” chromosomes.
combiners¶
Combiner functions for Python list-like input.
-
skgenome.combiners.get_combiners(table, stranded=False, combine=None)[source]¶ Get a combine lookup suitable for table.
- Parameters
table (DataFrame) –
stranded (bool) –
combine (dict or None) – Column names to their value-combining functions, replacing or in addition to the defaults.
- Returns
Column names to their value-combining functions.
- Return type
dict
rangelabel¶
Handle text genomic ranges as named tuples.
A range specification should look like chromosome:start-end, e.g.
chr1:1234-5678, with 1-indexed integer coordinates. We also allow
chr1:1234- or chr1:-5678, where missing start becomes 0 and missing end
becomes None.
-
class
skgenome.rangelabel.NamedRegion(chromosome, start, end, gene)¶ Bases:
tuple-
property
chromosome¶ Alias for field number 0
-
property
end¶ Alias for field number 2
-
property
gene¶ Alias for field number 3
-
property
start¶ Alias for field number 1
-
property
-
class
skgenome.rangelabel.Region(chromosome, start, end)¶ Bases:
tuple-
property
chromosome¶ Alias for field number 0
-
property
end¶ Alias for field number 2
-
property
start¶ Alias for field number 1
-
property
-
skgenome.rangelabel.from_label(text, keep_gene=True)[source]¶ Parse a chromosomal range specification.
- Parameters
text (string) – Range specification, which should look like
chr1:1234-5678orchr1:1234-orchr1:-5678, where missing start becomes 0 and missing end becomes None.