Tutorial¶
This should get you started with using dnaio
.
The only essential concepts to know about are
the dnaio.open
function and the SequenceRecord
object.
Reading¶
The main interface for reading and writing sequence files is the dnaio.open
function.
For example, this program reads in a FASTQ file and computes the total number of nucleotides
it contains:
import dnaio
with dnaio.open("reads.fastq.gz") as reader:
bp = 0
for record in reader:
bp += len(record)
print(f"The input file contains {bp/1E6:.1f} Mbp")
As can be seen from the .gz
file extension,
the input file is gzip-compressed.
dnaio.open
detects and handles this automatically by opening the file with
xopen.
Here, the call to dnaio.open
returns a FastqReader
object.
Iterating over it in the for
loop results in SequenceRecord
objects.
Calling len()
on a SequenceRecord
returns the number of
nucleotides in the record.
A SequenceRecord
has the attributes name
, sequence
and qualities
. All of these are str
objects.
The qualities
attribute is None
when reading FASTA files.
The following program uses the name
attribute
to check whether any sequence names are duplicated in a FASTA file:
import dnaio
seen = set()
with dnaio.open("sequences.fasta") as reader:
for record in reader:
if record.name in seen:
print(record.name, "is duplicated")
seen.add(record.name)
Writing¶
To open a sequence file for writing,
pass the mode="w"
argument to dnaio.open
:
import dnaio
with dnaio.open("onerecord.fastq.gz", mode="w") as writer:
writer.write(dnaio.SequenceRecord("name", "ACGT", "#B!#"))
Here, a FastqWriter
object is returned by dnaio.open
,
which has a write()
method that accepts a SequenceRecord
.
A possibly more common use case is to read an input file, modify the reads and write them to a new output file. The following example program shows how that can be done. It truncates all reads in the input file to a length of 30 nt and writes them to another file:
import dnaio
with dnaio.open("in.fastq.gz") as reader, dnaio.open("out.fastq.gz", mode="w") as writer:
for record in reader:
record = record[:30]
writer.write(record)
This also shows that SequenceRecord
objects support slicing:
record[:30]
returns a new SequenceRecord
object with the sequence and qualities
trimmed to the first 30 characters, leaving the name unchanged.
Paired-end data¶
Paired-end data is supported in two forms: Two separate files or interleaved.
To read from separate files, provide two input file names to the dnaio.open
function:
import dnaio
with dnaio.open("reads.1.fastq.gz", "reads.2.fastq.gz") as reader:
bp = 0
for r1, r2 in reader:
bp += len(r1) + len(r2)
print(f"The paired-end input contains {bp/1E6:.1f} Mbp")
Here, dnaio.open
returns a TwoFilePairedEndReader
.
It also supports iteration, but instead of a plain SequenceRecord
,
it returns a tuple of two SequenceRecord
instances.
To read from interleaved paired-end data,
pass interleaved=True
to dnaio.open
instead of a second file name:
...
with dnaio.open("reads.interleaved.fastq.gz", interleaved=True) as reader:
...
The PairedEndReader
classes check whether the input files are properly paired,
that is, whether they have the same number of reads in both inputs and whether the
read names match.
For this reason, always use a single call to dnaio.open
to open paired-end files
(that is, avoid opening them as two single-end files.)
To demonstrate how to write paired-end data, we show a program that reads from a single-end FASTQ file and converts the records to simulated paired-end reads by writing the first 30 nt to R1 and the last 30 nt to R2:
import dnaio
with dnaio.open("in.fastq.gz") as reader, \
dnaio.open("out.1.fastq.gz", "out.2.fastq.gz", mode="w") as writer:
for record in reader:
r1 = record[:30]
r2 = record[-30:]
writer.write(r1, r2)
The writer
in this case is a TwoFilePairedEndWriter
and its write()
method
expects two SequenceRecord
arguments.