The dnaio API¶
The open function¶
- dnaio.open(_file: str | ~os.PathLike | ~typing.BinaryIO, *, fileformat: str | None = None, interleaved: ~typing.Literal[False] = False, mode: ~typing.Literal['r'] = 'r', qualities: bool | None = None, opener=<function xopen>, compression_level: int = 1, open_threads: int = 0) SingleEndReader ¶
- dnaio.open(_file1: str | ~os.PathLike | ~typing.BinaryIO, _file2: str | ~os.PathLike | ~typing.BinaryIO, *, fileformat: str | None = None, interleaved: ~typing.Literal[False] = False, mode: ~typing.Literal['r'] = 'r', qualities: bool | None = None, opener=<function xopen>, compression_level: int = 1, open_threads: int = 0) PairedEndReader
- dnaio.open(_file: str | ~os.PathLike | ~typing.BinaryIO, *, interleaved: ~typing.Literal[True], fileformat: str | None = None, mode: ~typing.Literal['r'] = 'r', qualities: bool | None = None, opener=<function xopen>, compression_level: int = 1, open_threads: int = 0) PairedEndReader
- dnaio.open(_file1: str | ~os.PathLike | ~typing.BinaryIO, _file2: str | ~os.PathLike | ~typing.BinaryIO, _file3: str | ~os.PathLike | ~typing.BinaryIO, *files: str | ~os.PathLike | ~typing.BinaryIO, fileformat: str | None = None, mode: ~typing.Literal['r'] = 'r', qualities: bool | None = None, opener=<function xopen>, compression_level: int = 1, open_threads: int = 0) MultipleFileReader
- dnaio.open(_file: str | ~os.PathLike | ~typing.BinaryIO, *, mode: ~typing.Literal['w', 'a'], fileformat: str | None = None, interleaved: ~typing.Literal[False] = False, qualities: bool | None = None, opener=<function xopen>, compression_level: int = 1, open_threads: int = 0) SingleEndWriter
- dnaio.open(_file1: str | ~os.PathLike | ~typing.BinaryIO, _file2: str | ~os.PathLike | ~typing.BinaryIO, *, mode: ~typing.Literal['w', 'a'], fileformat: str | None = None, interleaved: ~typing.Literal[False] = False, qualities: bool | None = None, opener=<function xopen>, compression_level: int = 1, open_threads: int = 0) PairedEndWriter
- dnaio.open(_file: str | ~os.PathLike | ~typing.BinaryIO, *, mode: ~typing.Literal['w', 'a'], interleaved: ~typing.Literal[True], fileformat: str | None = None, qualities: bool | None = None, opener=<function xopen>, compression_level: int = 1, open_threads: int = 0) PairedEndWriter
- dnaio.open(_file1: str | ~os.PathLike | ~typing.BinaryIO, _file2: str | ~os.PathLike | ~typing.BinaryIO, _file3: str | ~os.PathLike | ~typing.BinaryIO, *files: str | ~os.PathLike | ~typing.BinaryIO, mode: ~typing.Literal['w', 'a'], fileformat: str | None = None, interleaved: ~typing.Literal[False] = False, qualities: bool | None = None, opener=<function xopen>, compression_level: int = 1, open_threads: int = 0) MultipleFileWriter
Open one or more FASTQ or FASTA files for reading or writing, or open one (unaligned) BAM file for reading.
- Parameters:
files – one or more Path or open file-like objects. One for single-end reads, two for paired-end reads etc. More than two files are also supported. At least one file is required.
file1 – Deprecated keyword argument for the first file.
file2 – Deprecated keyword argument for the second file.
mode – Set to
'r'
for reading,'w'
for writing or'a'
for appending. For BAM files, only reading is supported.interleaved – If True, then there must be only one file argument that contains interleaved paired-end data.
fileformat – If None, the file format is autodetected from the file name extension. Set to
'fasta'
,'fastq'
or'bam'
to not auto-detect.qualities –
When mode is
'w'
and fileformat is None, this can be set to True or False to specify whether the written sequences will have quality values. This is used in two ways:If the output format cannot be determined (unrecognized extension etc.), no exception is raised, but FASTA or FASTQ format is chosen appropriately.
When False (no qualities available), an exception is raised when the auto-detected output format is FASTQ.
opener – A function that is used to open the files if they are not already open file-like objects. By default,
xopen
is used, which can also open compressed file formats.open_threads – By default, dnaio opens files in the main thread. When threads is greater than 0, external processes are opened for compressing and decompressing files. This decreases wall clock time at the cost of a little extra overhead. This parameter does not work when a custom opener is set.
compression_level – By default dnaio uses compression level 1 for writing gzipped files as this is the fastest. A higher level can be set using this parameter. This parameter does not work when a custom opener is set.
The SequenceRecord
class¶
- class dnaio.SequenceRecord¶
A named sequence with optional quality values. This typically represents a record from a FASTA or FASTQ file. The readers returned by
dnaio.open
yield objects of this type when mode is set to"r"
- name¶
The read header
- Type:
str
- sequence¶
The nucleotide (or amino acid) sequence
- Type:
str
- qualities¶
None if no quality values are available (such as when the record comes from a FASTA file). If quality values are available, this is a string that contains the Phred-scaled qualities encoded as ASCII(qual+33) (as in FASTQ).
- Type:
str
- Raises:
ValueError – One of the provide attributes is not ASCII or the lengths of sequence and qualities differ
- __init__(name: str, sequence: str, qualities: str | None = None)¶
- __getitem__()¶
Slice this SequenceRecord. If the qualities attribute is not None, it is sliced accordingly. The read name is copied unchanged.
- Returns:
A new
SequenceRecord
object representing the sliced sequence.
- __len__()¶
- Returns:
The number of characters in the sequence
- comment¶
The header part after the first whitespace. This is usually used to store metadata. It may be empty in which case the attribute is None.
- fastq_bytes(two_headers=False)¶
Format this record in FASTQ format
- Parameters:
two_headers (bool) – If True, repeat the header (after the
@
) on the third line (after the+
)- Returns:
A bytes object with the formatted record. This can be written directly to a file.
- id¶
The header part before any whitespace. This is the unique identifier for the sequence.
- is_mate(other)¶
Check whether this instance and another are part of the same read pair
Checking is done by comparing IDs. The ID is the part of the name before the first whitespace. Any 1, 2 or 3 at the end of the IDs is excluded from the check as forward reads may have a 1 appended to their ID and reverse reads a 2 etc.
- Parameters:
other (SequenceRecord) – The object to compare to
- Returns:
Whether this and other are part of the same read pair.
- Return type:
bool
- qualities_as_bytes()¶
Return the qualities as a bytes object.
This is a faster version of
record.qualities.encode('ascii')
.
- reverse_complement()¶
Return a reverse-complemented version of this record.
The name remains unchanged.
The sequence is reverse complemented.
If quality values exist, their order is reversed.
Reader and writer interfaces¶
- class dnaio.SingleEndReader¶
- abstract __iter__()¶
Iterate over an input containing sequence records
- Yields:
SequenceRecord
objects- Raises:
FileFormatError – if there was a parse error
- Return type:
Iterator[SequenceRecord]
- class dnaio.PairedEndReader¶
- abstract __iter__()¶
Iterate over an input containing paired-end records
- Yields:
Pairs of
SequenceRecord
objects- Raises:
FileFormatError – if there was a parse error or if reads are improperly paired, that is, if there are more reads in one file than the other or if the record IDs do not match (according to
SequenceRecord.is_mate
).- Return type:
Iterator[Tuple[SequenceRecord, SequenceRecord]]
- class dnaio.SingleEndWriter¶
- abstract write(record)¶
Write a
SequenceRecord
to the output.- Parameters:
record (SequenceRecord)
- Return type:
None
- class dnaio.PairedEndWriter¶
- abstract write(record1, record2)¶
Write a pair of
SequenceRecord
objects to the paired-end output.This method does not verify that both records have matching IDs because this was already done at parsing time. If it is possible that the record IDs no longer match, check that
record1.is_mate(record2)
returns True before calling this method.- Parameters:
record1 (SequenceRecord)
record2 (SequenceRecord)
- Return type:
None
- class dnaio.MultipleFileWriter¶
- abstract write(*records)¶
Write N SequenceRecords to the output. N must be equal to the number of files the MultipleFileWriter was initialized with.
This method does not check whether the records are properly paired.
- Parameters:
records (SequenceRecord)
- Return type:
None
- abstract write_iterable(list_of_records)¶
Iterate over the list (or other iterable container) and write all N-tuples of SequenceRecord to disk. N must be equal to the number of files the MultipleFileWriter was initialized with.
This method does not check whether the records are properly paired. This method may provide a speed boost over calling write for each tuple of SequenceRecords individually.
- Parameters:
list_of_records (Iterable[Tuple[SequenceRecord, ...]])
Reader and writer classes¶
The dnaio.open
function returns an instance of one of the following classes.
They can also be used directly if needed.
- class dnaio.FastaReader(file, *, keep_linebreaks=False, sequence_class=<class 'dnaio._core.SequenceRecord'>, opener=<function xopen>, _close_file=None)¶
Bases:
BinaryFileReader
,SingleEndReader
Reader for FASTA files
While this class can be instantiated directly, the recommended way is to use
dnaio.open
with appropriate arguments.- Parameters:
file (PathLike | str | BinaryIO)
keep_linebreaks (bool)
_close_file (bool | None)
- class dnaio.FastaWriter(file, *, line_length=None, opener=<function xopen>, _close_file=None)¶
Bases:
FileWriter
,SingleEndWriter
Write FASTA-formatted sequences to a file
While this class can be instantiated directly, the recommended way is to use
dnaio.open
with appropriate arguments unless you need to set the line_length argument.- Parameters:
line_length (int | None) – Wrap sequence lines after this many characters (None disables wrapping)
file (PathLike | str | BinaryIO)
_close_file (bool | None)
- class dnaio.FastqReader(file, *, sequence_class=<class 'dnaio._core.SequenceRecord'>, buffer_size=131072, opener=<function xopen>, _close_file=None)¶
Bases:
BinaryFileReader
,SingleEndReader
Reader for FASTQ files. Does not support multi-line FASTQ files.
While this class can be instantiated directly, the recommended way is to use
dnaio.open
with appropriate arguments.- Parameters:
file (PathLike | str | BinaryIO)
buffer_size (int)
_close_file (bool | None)
- class dnaio.FastqWriter(file, *, two_headers=False, opener=<function xopen>, _close_file=None)¶
Bases:
FileWriter
,SingleEndWriter
Write records in FASTQ format
While this class can be instantiated directly, the recommended way is to use
dnaio.open
with appropriate arguments unless you need to set two_headers to True.- Parameters:
two_headers (bool) – If True, the header is repeated on the third line of each record after the “+”.
file (PathLike | str | BinaryIO)
_close_file (bool | None)
- class dnaio.BamReader(file, *, sequence_class=<class 'dnaio._core.SequenceRecord'>, buffer_size=131072, opener=<function xopen>, _close_file=None, with_header=True)¶
Bases:
BinaryFileReader
,SingleEndReader
Reader for BAM files.
All records in the input BAM must be unmapped single-end reads (with a flag value of 4).
While this class can be instantiated directly, the recommended way is to use
dnaio.open
with appropriate arguments.- Parameters:
file (PathLike | str | BinaryIO)
buffer_size (int)
_close_file (bool | None)
with_header (bool)
- class dnaio.TwoFilePairedEndReader(file1, file2, *, mode='r', fileformat=None, opener=<function xopen>)¶
Bases:
PairedEndReader
Read paired-end reads from two files (not interleaved)
While this class can be instantiated directly, the recommended way is to use
dnaio.open
with appropriate arguments.- Parameters:
file1 (str | PathLike | BinaryIO)
file2 (str | PathLike | BinaryIO)
fileformat (str | None)
- class dnaio.TwoFilePairedEndWriter(file1, file2, *, fileformat='fastq', qualities=None, opener=<function xopen>, append=False)¶
Bases:
PairedEndWriter
Write paired-end reads to two files (not interleaved)
While this class can be instantiated directly, the recommended way is to use
dnaio.open
with appropriate arguments.- Parameters:
file1 (str | PathLike | BinaryIO)
file2 (str | PathLike | BinaryIO)
fileformat (str | None)
qualities (bool | None)
append (bool)
- class dnaio.InterleavedPairedEndReader(file, *, mode='r', fileformat=None, opener=<function xopen>)¶
Bases:
PairedEndReader
Read paired-end reads from an interleaved FASTQ file
While this class can be instantiated directly, the recommended way is to use
dnaio.open
with appropriate arguments.- Parameters:
file (str | PathLike | BinaryIO)
fileformat (str | None)
- class dnaio.InterleavedPairedEndWriter(file, *, fileformat='fastq', qualities=None, opener=<function xopen>, append=False)¶
Bases:
PairedEndWriter
Write paired-end reads to an interleaved FASTA or FASTQ file
While this class can be instantiated directly, the recommended way is to use
dnaio.open
with appropriate arguments.- Parameters:
file (str | PathLike | BinaryIO)
fileformat (str | None)
qualities (bool | None)
append (bool)
- class dnaio.MultipleFileReader(*files, fileformat=None, opener=<function xopen>)¶
Read multiple FASTA/FASTQ files simultaneously. Useful when additional FASTQ files with extra information are supplied (UMIs, indices etc.).
While this class can be instantiated directly, the recommended way is to use
dnaio.open
with appropriate arguments.- Parameters:
files (str | PathLike | BinaryIO)
fileformat (str | None)
- __iter__()¶
Iterate over multiple inputs containing records
- Yields:
N-tuples of
SequenceRecord
objects where N is equal to the number of files.- Raises:
FileFormatError – if there was a parse error or if reads are improperly paired, that is, if there are more reads in one file than the others or if the record IDs do not match (according to
records_are_mates
).- Return type:
Iterator[Tuple[SequenceRecord, …]]
- class dnaio.MultipleFastaWriter(*files, opener=<function xopen>, append=False)¶
Bases:
MultipleFileWriter
Write multiple FASTA files simultaneously.
While this class can be instantiated directly, the recommended way is to use
dnaio.open
with appropriate arguments.- Parameters:
files (str | PathLike | BinaryIO)
append (bool)
- class dnaio.MultipleFastqWriter(*files, opener=<function xopen>, append=False)¶
Bases:
MultipleFileWriter
Write multiple FASTA files simultaneously.
While this class can be instantiated directly, the recommended way is to use
dnaio.open
with appropriate arguments.- Parameters:
files (str | PathLike | BinaryIO)
append (bool)
Chunked reading of sequence records¶
The following functions can be used to very quickly split up the input file(s) into similarly-sized chunks without actually parsing the records. The chunks can then be distributed to worker threads or subprocesses and be parsed and processed there.
- dnaio.read_chunks(f, buffer_size=4194304)¶
Read chunks of complete FASTA or FASTQ records from a file. If the format is detected to be FASTQ, all chunks except possibly the last contain an even number of records such that interleaved paired-end reads remain in sync. The yielded memoryview objects are only valid for one iteration because the internal buffer is re-used in the next iteration.
- Parameters:
f (BufferedIOBase) – File with FASTA or FASTQ reads; must have been opened in binary mode
buffer_size (int) – Largest allowed chunk size
- Yields:
memoryview representing the chunk. This becomes invalid on the next iteration.
- Raises:
ValueError – A FASTQ record was encountered that is larger than buffer_size.
UnknownFileFormat – The file format could not be detected (the first byte must be “@”, “>” or “#”)
- Return type:
Iterator[memoryview]
- dnaio.read_paired_chunks(f, f2, buffer_size=4194304)¶
Read chunks of paired-end FASTA or FASTQ records from two files. A pair of chunks (memoryview objects) is yielded on each iteration, and both chunks are guaranteed to have the same number of sequences. That is, the paired-end reads will stay in sync.
The memoryviews are only valid for one iteration because the internal buffer is re-used in the next iteration.
This is similar to
read_chunks
, but for paired-end data.- Parameters:
f (BufferedIOBase) – File with R1 reads; must have been opened in binary mode
f2 (BufferedIOBase) – File with R2 reads; must have been opened in binary mode
buffer_size (int) – Largest allowed chunk size
- Yields:
Pairs of memoryview objects.
- Raises:
ValueError – A FASTA or FASTQ record was encountered that is larger than buffer_size.
- Return type:
Iterator[Tuple[memoryview, memoryview]]
Functions¶
- dnaio.records_are_mates(*args)¶
Check if the provided
SequenceRecord
objects are all mates of each other by comparing their record IDs. Accepts two or moreSequenceRecord
objects.This is the same as
SequenceRecord.is_mate
in the case of only two records, but allows for for cases where information is split into three records or more (such as UMI, R1, R2 or index, R1, R2).If there are only two records to check, prefer
SequenceRecord.is_mate
.Example usage:
for records in zip(*all_my_fastq_readers): if not records_are_mates(*records): raise MateError(f"IDs do not match for {records}")
- Parameters:
*args – two or more
SequenceRecord
objects- Return type:
bool
Returns: True or False
Exceptions¶
- exception dnaio.UnknownFileFormat¶
The file format could not be automatically detected
- exception dnaio.FileFormatError(msg, line)¶
The file is not formatted correctly
- Parameters:
msg (str)
line (int | None)
- line¶
If available, the number of the line at which the error occurred or None if not. The first line has index 0.
- exception dnaio.FastaFormatError(msg, line)¶
Bases:
FileFormatError
The FASTA file is not formatted correctly
- Parameters:
msg (str)
line (int | None)
- exception dnaio.FastqFormatError(msg, line)¶
Bases:
FileFormatError
The FASTQ file is not formatted correctly
- Parameters:
msg (str)
line (int | None)