The dnaio API#

The open function#

dnaio.open(_file: str | ~os.PathLike | ~typing.BinaryIO, *, fileformat: str | None = None, interleaved: ~typing.Literal[False] = False, mode: ~typing.Literal['r'] = 'r', qualities: bool | None = None, opener=<function xopen>, compression_level: int = 1, open_threads: int = 0) → SingleEndReader#

dnaio.open(_file1: str | ~os.PathLike | ~typing.BinaryIO, _file2: str | ~os.PathLike | ~typing.BinaryIO, *, fileformat: str | None = None, interleaved: ~typing.Literal[False] = False, mode: ~typing.Literal['r'] = 'r', qualities: bool | None = None, opener=<function xopen>, compression_level: int = 1, open_threads: int = 0) → PairedEndReader

dnaio.open(_file: str | ~os.PathLike | ~typing.BinaryIO, *, interleaved: ~typing.Literal[True], fileformat: str | None = None, mode: ~typing.Literal['r'] = 'r', qualities: bool | None = None, opener=<function xopen>, compression_level: int = 1, open_threads: int = 0) → PairedEndReader

dnaio.open(_file: str | ~os.PathLike | ~typing.BinaryIO, *, mode: ~typing.Literal['w', 'a'], fileformat: str | None = None, interleaved: ~typing.Literal[False] = False, qualities: bool | None = None, opener=<function xopen>, compression_level: int = 1, open_threads: int = 0) → SingleEndWriter

dnaio.open(_file1: str | ~os.PathLike | ~typing.BinaryIO, _file2: str | ~os.PathLike | ~typing.BinaryIO, *, mode: ~typing.Literal['w', 'a'], fileformat: str | None = None, interleaved: ~typing.Literal[False] = False, qualities: bool | None = None, opener=<function xopen>, compression_level: int = 1, open_threads: int = 0) → PairedEndWriter

dnaio.open(_file: str | ~os.PathLike | ~typing.BinaryIO, *, mode: ~typing.Literal['w', 'a'], interleaved: ~typing.Literal[True], fileformat: str | None = None, qualities: bool | None = None, opener=<function xopen>, compression_level: int = 1, open_threads: int = 0) → PairedEndWriter

Open one or more FASTQ or FASTA files for reading or writing, or open one (unaligned) BAM file for reading.

Parameters:

files – one or more Path or open file-like objects. One for single-end reads, two for paired-end reads etc. More than two files are also supported. At least one file is required.
file1 – Deprecated keyword argument for the first file.
file2 – Deprecated keyword argument for the second file.
mode – Set to 'r' for reading, 'w' for writing or 'a' for appending. For BAM files, only reading is supported.
interleaved – If True, then there must be only one file argument that contains interleaved paired-end data.
fileformat – If None, the file format is autodetected from the file name extension. Set to 'fasta', 'fastq' or 'bam' to not auto-detect.
qualities –
When mode is 'w' and fileformat is None, this can be set to True or False to specify whether the written sequences will have quality values. This is used in two ways:
- If the output format cannot be determined (unrecognized extension etc.), no exception is raised, but FASTA or FASTQ format is chosen appropriately.
- When False (no qualities available), an exception is raised when the auto-detected output format is FASTQ.
opener – A function that is used to open the files if they are not already open file-like objects. By default, xopen is used, which can also open compressed file formats.
open_threads – By default, dnaio opens files in the main thread. When threads is greater than 0, external processes are opened for compressing and decompressing files. This decreases wall clock time at the cost of a little extra overhead. This parameter does not work when a custom opener is set.
compression_level – By default dnaio uses compression level 1 for writing gzipped files as this is the fastest. A higher level can be set using this parameter. This parameter does not work when a custom opener is set.

The `SequenceRecord` class#

class dnaio.SequenceRecord#

A named sequence with optional quality values. This typically represents a record from a FASTA or FASTQ file. The readers returned by dnaio.open yield objects of this type when mode is set to "r"

name#

The read header

Type:: str

sequence#

The nucleotide (or amino acid) sequence

Type:: str

qualities#

None if no quality values are available (such as when the record comes from a FASTA file). If quality values are available, this is a string that contains the Phred-scaled qualities encoded as ASCII(qual+33) (as in FASTQ).

Type:: str

Raises:: ValueError – One of the provide attributes is not ASCII or the lengths of sequence and qualities differ

__init__(name: str, sequence: str, qualities: str | None = None)#

__getitem__()#

Slice this SequenceRecord. If the qualities attribute is not None, it is sliced accordingly. The read name is copied unchanged.

Returns:: A new SequenceRecord object representing the sliced sequence.

__len__()#

Returns:: The number of characters in the sequence

comment#: The header part after the first whitespace. This is usually used to store metadata. It may be empty in which case the attribute is None.

fastq_bytes(two_headers=False)#

Format this record in FASTQ format

Parameters:: two_headers (bool) – If True, repeat the header (after the @) on the third line (after the +)
Returns:: A bytes object with the formatted record. This can be written directly to a file.

id#: The header part before any whitespace. This is the unique identifier for the sequence.

is_mate(other)#

Check whether this instance and another are part of the same read pair

Checking is done by comparing IDs. The ID is the part of the name before the first whitespace. Any 1, 2 or 3 at the end of the IDs is excluded from the check as forward reads may have a 1 appended to their ID and reverse reads a 2 etc.

Parameters:: other (SequenceRecord) – The object to compare to
Returns:: Whether this and other are part of the same read pair.
Return type:: bool

qualities_as_bytes()#

Return the qualities as a bytes object.

This is a faster version of record.qualities.encode('ascii').

reverse_complement()#

Return a reverse-complemented version of this record.

The name remains unchanged.
The sequence is reverse complemented.
If quality values exist, their order is reversed.

Reader and writer interfaces#

class dnaio.SingleEndReader#

abstract __iter__()#

Iterate over an input containing sequence records

Yields:: SequenceRecord objects
Raises:: FileFormatError – if there was a parse error
Return type:: Iterator[SequenceRecord]

class dnaio.PairedEndReader#

abstract __iter__()#

Iterate over an input containing paired-end records

Yields:: Pairs of SequenceRecord objects
Raises:: FileFormatError – if there was a parse error or if reads are improperly paired, that is, if there are more reads in one file than the other or if the record IDs do not match (according to SequenceRecord.is_mate).
Return type:: Iterator[Tuple[SequenceRecord, SequenceRecord]]

class dnaio.SingleEndWriter#

abstract write(record)#

Write a SequenceRecord to the output.

Parameters:: record (SequenceRecord) –
Return type:: None

class dnaio.PairedEndWriter#

abstract write(record1, record2)#

Write a pair of SequenceRecord objects to the paired-end output.

This method does not verify that both records have matching IDs because this was already done at parsing time. If it is possible that the record IDs no longer match, check that record1.is_mate(record2) returns True before calling this method.

Parameters:

record1 (SequenceRecord) –
record2 (SequenceRecord) –

Return type:

None

class dnaio.MultipleFileWriter#

abstract write(*records)#

Write N SequenceRecords to the output. N must be equal to the number of files the MultipleFileWriter was initialized with.

This method does not check whether the records are properly paired.

Parameters:: records (SequenceRecord) –
Return type:: None

abstract write_iterable(list_of_records)#

Iterate over the list (or other iterable container) and write all N-tuples of SequenceRecord to disk. N must be equal to the number of files the MultipleFileWriter was initialized with.

This method does not check whether the records are properly paired. This method may provide a speed boost over calling write for each tuple of SequenceRecords individually.

Parameters:: list_of_records (Iterable[Tuple[SequenceRecord, ...]]) –

Reader and writer classes#

The dnaio.open function returns an instance of one of the following classes. They can also be used directly if needed.

class dnaio.FastaReader(file, *, keep_linebreaks=False, sequence_class=<class 'dnaio._core.SequenceRecord'>, opener=<function xopen>, _close_file=None)#

Bases: BinaryFileReader, SingleEndReader

Reader for FASTA files

While this class can be instantiated directly, the recommended way is to use dnaio.open with appropriate arguments.

Parameters:

file (PathLike | str | BinaryIO) –
keep_linebreaks (bool) –
_close_file (bool | None) –

class dnaio.FastaWriter(file, *, line_length=None, opener=<function xopen>, _close_file=None)#

Bases: FileWriter, SingleEndWriter

Write FASTA-formatted sequences to a file

While this class can be instantiated directly, the recommended way is to use dnaio.open with appropriate arguments unless you need to set the line_length argument.

Parameters:

line_length (int | None) – Wrap sequence lines after this many characters (None disables wrapping)
file (PathLike | str | BinaryIO) –
_close_file (bool | None) –

class dnaio.FastqReader(file, *, sequence_class=<class 'dnaio._core.SequenceRecord'>, buffer_size=131072, opener=<function xopen>, _close_file=None)#

Bases: BinaryFileReader, SingleEndReader

Reader for FASTQ files. Does not support multi-line FASTQ files.

While this class can be instantiated directly, the recommended way is to use dnaio.open with appropriate arguments.

Parameters:

file (PathLike | str | BinaryIO) –
buffer_size (int) –
_close_file (bool | None) –

class dnaio.FastqWriter(file, *, two_headers=False, opener=<function xopen>, _close_file=None)#

Bases: FileWriter, SingleEndWriter

Write records in FASTQ format

While this class can be instantiated directly, the recommended way is to use dnaio.open with appropriate arguments unless you need to set two_headers to True.

Parameters:

two_headers (bool) – If True, the header is repeated on the third line of each record after the “+”.
file (PathLike | str | BinaryIO) –
_close_file (bool | None) –

class dnaio.BamReader(file, *, sequence_class=<class 'dnaio._core.SequenceRecord'>, buffer_size=131072, opener=<function xopen>, _close_file=None)#

Bases: BinaryFileReader, SingleEndReader

Reader for BAM files.

All records in the input BAM must be unmapped single-end reads (with a flag value of 4).

While this class can be instantiated directly, the recommended way is to use dnaio.open with appropriate arguments.

Parameters:

file (PathLike | str | BinaryIO) –
buffer_size (int) –
_close_file (bool | None) –

class dnaio.TwoFilePairedEndReader(file1, file2, *, mode='r', fileformat=None, opener=<function xopen>)#

Bases: PairedEndReader

Read paired-end reads from two files (not interleaved)

While this class can be instantiated directly, the recommended way is to use dnaio.open with appropriate arguments.

Parameters:

file1 (str | PathLike | BinaryIO) –
file2 (str | PathLike | BinaryIO) –
fileformat (str | None) –

class dnaio.TwoFilePairedEndWriter(file1, file2, *, fileformat='fastq', qualities=None, opener=<function xopen>, append=False)#

Bases: PairedEndWriter

Write paired-end reads to two files (not interleaved)

While this class can be instantiated directly, the recommended way is to use dnaio.open with appropriate arguments.

Parameters:

file1 (str | PathLike | BinaryIO) –
file2 (str | PathLike | BinaryIO) –
fileformat (str | None) –
qualities (bool | None) –
append (bool) –

class dnaio.InterleavedPairedEndReader(file, *, mode='r', fileformat=None, opener=<function xopen>)#

Bases: PairedEndReader

Read paired-end reads from an interleaved FASTQ file

While this class can be instantiated directly, the recommended way is to use dnaio.open with appropriate arguments.

Parameters:

file (str | PathLike | BinaryIO) –
fileformat (str | None) –

class dnaio.InterleavedPairedEndWriter(file, *, fileformat='fastq', qualities=None, opener=<function xopen>, append=False)#

Bases: PairedEndWriter

Write paired-end reads to an interleaved FASTA or FASTQ file

While this class can be instantiated directly, the recommended way is to use dnaio.open with appropriate arguments.

Parameters:

file (str | PathLike | BinaryIO) –
fileformat (str | None) –
qualities (bool | None) –
append (bool) –

class dnaio.MultipleFileReader(*files, fileformat=None, opener=<function xopen>)#

Read multiple FASTA/FASTQ files simultaneously. Useful when additional FASTQ files with extra information are supplied (UMIs, indices etc.).

While this class can be instantiated directly, the recommended way is to use dnaio.open with appropriate arguments.

Parameters:

files (str | PathLike | BinaryIO) –
fileformat (str | None) –

__iter__()#

Iterate over multiple inputs containing records

Yields:: N-tuples of SequenceRecord objects where N is equal to the number of files.
Raises:: FileFormatError – if there was a parse error or if reads are improperly paired, that is, if there are more reads in one file than the others or if the record IDs do not match (according to records_are_mates).
Return type:: Iterator[Tuple[SequenceRecord, …]]

class dnaio.MultipleFastaWriter(*files, opener=<function xopen>, append=False)#

Bases: MultipleFileWriter

Write multiple FASTA files simultaneously.

While this class can be instantiated directly, the recommended way is to use dnaio.open with appropriate arguments.

Parameters:

files (str | PathLike | BinaryIO) –
append (bool) –

class dnaio.MultipleFastqWriter(*files, opener=<function xopen>, append=False)#

Bases: MultipleFileWriter

Write multiple FASTA files simultaneously.

While this class can be instantiated directly, the recommended way is to use dnaio.open with appropriate arguments.

Parameters:

files (str | PathLike | BinaryIO) –
append (bool) –

Chunked reading of sequence records#

The following functions can be used to very quickly split up the input file(s) into similarly-sized chunks without actually parsing the records. The chunks can then be distributed to worker threads or subprocesses and be parsed and processed there.

dnaio.read_chunks(f, buffer_size=4194304)#

Read chunks of complete FASTA or FASTQ records from a file. If the format is detected to be FASTQ, all chunks except possibly the last contain an even number of records such that interleaved paired-end reads remain in sync. The yielded memoryview objects are only valid for one iteration because the internal buffer is re-used in the next iteration.

Parameters:

f (RawIOBase) – File with FASTA or FASTQ reads; must have been opened in binary mode
buffer_size (int) – Largest allowed chunk size

Yields:

memoryview representing the chunk. This becomes invalid on the next iteration.

Raises:

ValueError – A FASTQ record was encountered that is larger than buffer_size.
UnknownFileFormat – The file format could not be detected (the first byte must be “@”, “>” or “#”)

Return type:

Iterator[memoryview]

dnaio.read_paired_chunks(f, f2, buffer_size=4194304)#

Read chunks of paired-end FASTA or FASTQ records from two files. A pair of chunks (memoryview objects) is yielded on each iteration, and both chunks are guaranteed to have the same number of sequences. That is, the paired-end reads will stay in sync.

The memoryviews are only valid for one iteration because the internal buffer is re-used in the next iteration.

This is similar to read_chunks, but for paired-end data.

Parameters:

f (RawIOBase) – File with R1 reads; must have been opened in binary mode
f2 (RawIOBase) – File with R2 reads; must have been opened in binary mode
buffer_size (int) – Largest allowed chunk size

Yields:

Pairs of memoryview objects.

Raises:

ValueError – A FASTA or FASTQ record was encountered that is larger than buffer_size.

Return type:

Iterator[Tuple[memoryview, memoryview]]

Functions#

dnaio.records_are_mates(*args)#

Check if the provided SequenceRecord objects are all mates of each other by comparing their record IDs. Accepts two or more SequenceRecord objects.

This is the same as SequenceRecord.is_mate in the case of only two records, but allows for for cases where information is split into three records or more (such as UMI, R1, R2 or index, R1, R2).

If there are only two records to check, prefer SequenceRecord.is_mate.

Example usage:

for records in zip(*all_my_fastq_readers):
    if not records_are_mates(*records):
        raise MateError(f"IDs do not match for {records}")

Parameters:: *args – two or more SequenceRecord objects
Return type:: bool

Returns: True or False

Exceptions#

exception dnaio.UnknownFileFormat#: The file format could not be automatically detected

exception dnaio.FileFormatError(msg, line)#

The file is not formatted correctly

Parameters:

msg (str) –
line (int | None) –

line#: If available, the number of the line at which the error occurred or None if not. The first line has index 0.

exception dnaio.FastaFormatError(msg, line)#

Bases: FileFormatError

The FASTA file is not formatted correctly

Parameters:

msg (str) –
line (int | None) –

exception dnaio.FastqFormatError(msg, line)#

Bases: FileFormatError

The FASTQ file is not formatted correctly

Parameters:

msg (str) –
line (int | None) –

The dnaio API#

The open function#

The SequenceRecord class#

Reader and writer interfaces#

Reader and writer classes#

Chunked reading of sequence records#

Functions#

Exceptions#

The `SequenceRecord` class#