Software SQPSK Modem Project - Phil Karn, KA9Q


Lately I've been working on a high performance coded 1200 bps amateur packet radio modem that can be implemented purely in software on any reasonably capable modern general purpose personal computer, e.g., a Pentium or fast 486. Through the use of Staggered Quadrature Phase Shift Keying (SQPSK) modulation, "a posteriori" demodulation and strong error correction coding, this modem performs well at Eb/N0 [1] (signal-to-noise per bit) ratios as low as 3-4 dB. This is about 6-7 dB below that required by a conventional 1200bps BPSK PACSAT modem, but it operates in the same bandwidth.

The only hardware required, besides the PC and SSB transceiver, is a conventional sound card like the Soundblaster 16. The host CPU does all DSP. The sound card is used only for A/D and D/A conversion; no special DSP chip is required.

Although the modem code was developed for and runs best on a Pentium, it's fast enough to probably run well on a 486. (How fast a 486? I don't know yet.) The code should also run well on the Power PC CPU family, as they also have fast floating point units.

I now have the entire modem working in a test environment. It works quite well on synthetic signals, i.e. test data files with artificially generated Gaussian noise. Once I clean up the code, I'll be ready to integrate it into KA9Q NOS for some actual on-air experiments.

Basic Design Considerations

Packet vs Stream Modems

Most amateur packet radio modems were originally intended for continuous data streams. The packet format is AX.25 in synchronous HDLC framing that was originally designed for reliable wired networks, not multiple access radio channels. The HDLC frame format has no special features to aid the modem other than the 0-bit stuffing to guarantee a certain transition density for clock recovery. In fact, certain HDLC "features" are a downright nuisance; for example, the continuous idle flag stream between PACSAT downlink frames causes serious false lock problems in BPSK Costas loop demodulators.

This modem is specifically designed to send and receive packets up to some maximum length on a multiple access packet channel. Each packet starts with a preamble to facilitate rapid and reliable acquisition in the presence of frequency uncertainty, and fairly complex software is needed to process the preamble. On the other hand, some packet modem tasks are actually simpler than in a modem designed for a continuous data stream. For example, symbol timing is established on a "one shot" basis in the preamble and allowed to free run through the rest of the packet, so scrambling and bit stuffing are not required. The sampling clocks in the sound cards are assumed accurate enough to obviate the need for continous corrections to symbol timing.

Modem Design Choices and Rationale

Signal to Noise Ratios

One strong and easily implemented FEC code is a rate 1/2 constraint length 32 convolutional code that can be sequentially decoded with the Fano algorithm.

Assuming soft decision samples, this code requires an Eb/N0 of about +3 dB. At 1200 bps, this corresponds to a minimum C/N0 [3] of about +33.8 dB-Hz. The packet preamble must therefore be designed to allow reliable acqusition at this level or lower.

Sampling Rate

The modem operates at 9600 samples per second. This is driven by two requirements: the rate must be at least twice as high as the highest frequency that can pass a SSB filter, plus an allowance for the anti-aliasing filter roll-off (and probable group-delay distortion) in the sound card's A/D converter circuit. Second, the sampling rate should be an even integral multiple of the modem symbol rate of 1200 baud. Thus 8x1200 = 9600 samples/sec, a common value for voiceband modems.

Choice of SQPSK

As already stated, the modem data rate is 1200 bps, the same as a standard PACSAT BPSK modem. Since the signal must be no wider than existing 1200 bps BPSK modems to fit through standard SSB filters, I chose QPSK (4-ary PSK) modulation to provide the extra channel data rate needed for the forward error correction overhead. QPSK can carry twice as many bits per second per hertz as BPSK for the same Eb/N0 ratio because the in-phase (I) and quadrature (Q) channels are orthogonal; they're effectively two independent BPSK signals sharing the same RF channel. But the free lunch stops here. Modems using more than 4 signal phases are possible, but because these phases are no longer orthogonal, more power (Eb/N0) is required.

With QPSK alone, I could achieve 2400 bps in the same bandwidth as a 1200 bps BPSK modem. But I wouldn't save any power; I'd still require an Eb/N0 of about 10 dB, and at 2400 bps I'd require double the total transmit power of a 1200 bps BPSK modem. So I instead spend the QPSK "bonus" on FEC overhead. A rate 1/2 convolutional code generates two encoded symbols for every user data bit, so this brings me back to a 1200 bps user data rate. But now I have FEC working for me, which is how I can cut my power requirements substantially without using any more bandwidth.

But even QPSK isn't really a "free lunch". It is more complex than BPSK, and it is particularly sensitive to errors in carrier phase recovery at the receiver. In BPSK, a small carrier phase error ø simply reduces the effective signal strength to cos(ø) which for small phase errors is nearly unity. But phase errors in QPSK also cause crosstalk between the channels that's proportional to sin(ø), and this grows much more rapidly than cos(ø) decreases. Accurate carrier phase recovery turns out to be the single hardest problem in my modem. It probably accounts for most of the implementation loss. As we shall see later, it certainly makes doppler tracking a challenge.

The cross-talk effect can be mitigated somewhat in QPSK by "staggering" the I and Q channel data streams. That is, instead of beginning the bits in each channel at the same time, the Q channel bits change in the middle of the I bit time and vice versa. With perfect carrier phase recovery, this has no effect (other than to complicate the modem design!) But assume there's some phase error. Due to the staggering, any given I-channel symbol time sees the latter half of one Q-channel symbol and the first half of another symbol. If these two symbols happen to be the same, there's no difference from the non-staggered case; the symbols cause some crosstalk. But if the two Q-channel symbols are different, then their crosstalk into the I channel tends to cancel! If Q-channel transitions occur in 50% of the I-channel intervals, then the average crosstalk to the I channel is reduced by 3dB, a significant advantage.

SQPSK has another advantage over conventional QPSK in that phase transitions of 180 degrees cannot occur; all transitions are +/- 90 degrees. When a SQPSK signal is bandpass filtered, the amplitude variations that would otherwise occur on 180 degree transitions is greatly reduced.

The price for SQPSK is mainly paid in increased complexity, such as the need to integrate and dump over two different intervals at the receiver, and in the need to consider the "dangling" half symbol at the start and end of each packet.

Packet Preamble Design

Carrier Leader

Because of the desire to handle as much large frequency uncertainty as possible, the preamble begins with a burst of unmodulated carrier as a carrier phase and frequency reference. This burst must be long enough to be reliably detected at the output of a bandpass filter at or below the design minimum of C/N0 = +33.8 dB-Hz.

The post-detection S/N ratio of an integrate-and-dump bandpass filter is inversely proportional to the filter's effective bandwidth, or equivalently is directly proportional to detector integration time. In other words, the longer the carrier burst the more reliably it can be detected at low signal levels.

A 37.5 Hz bandpass filter, corresponding to an integration time of 26.6 ms (32 symbol times or 256 samples), gives a post detection S/N ratio of 10log10(37.5) = 15.74 dB below its input C/N0 value (which in turn corresponds to the S/N ratio for coherent integration over one second). A C/N0 of 33.8 dB corresponds to a post-detection S/N of 18 dB, high enough to be reliably detected.

A bank of 37.5 Hz bandpass filters can be implemented with a 256-point fast Fourier transform (FFT). I have written a radix-4 floating point FFT in C with an optional assembler "assist" for the Pentium that can perform a 256-point decimation-in-frequency complex FFT in about 315 microseconds. This is plenty fast for this application.

The carrier burst length is actually twice this long (64 symbols or 512 samples), for two reasons. First, because the packet can start at any arbitrary time, this guarantees that at least one 256-point receiver FFT sampling window will land wholly in the preamble regardless of its timing with respect to the start of the packet. Second, after the sync vector has been detected, the entire preamble is used to further refine the demodulator's estimate of carrier phase, frequency and amplitude in preparation for data demodulation. The longer carrier burst aids this process.

Synchronization Vector

The carrier burst is followed by a 32-bit synchronization vector sent in BPSK. The sync vector is a timing reference that marks the beginning of the packet header and establishes symbol timing.

The carrier burst establishes a phase reference for the sync vector, with the phase of the carrier burst defined as binary "1" and its inverse being "0". The sync vector is hex 25555555. This value was chosen by brute force computer search for its autocorrelation properties, but further study may produce a better value.

Thanks to the regular 010101 pattern it contains, this vector has relatively high sidelobes as compared with, say, Barker or PN sequences. But this same property narrows the main lobe, thus reducing the chance that the correlator in the demodulator, perturbed by noise, might lock one or two samples off the true peak. Since the sync vector establishes a "one-shot" bit timing reference for the entire packet, it is important that it be detected precisely on time.

A computer simulation showed this vector to be highly reliable at the design C/N0 ratio. It is possible that a shorter, carefully chosen sync vector might perform as well, but the 32-bit overhead is not that significant.

Packet Header

Following the sync vector is the packet header. At the moment, the packet header is 128 bits long. It includes 4 bytes each for source and destination addresses [4], a one byte data field length, three spare bytes and a 4-byte FEC encoder tail that cannot be used for data. The header format is bound to change, particularly by adding fields to implement my MACA channel access protocol. After convolutional coding (see next section), the coded symbols are interleaved and SQPSK modulated. The symbols alternate between the I and Q channels, with the even numbered symbols (starting with 0) going to the I channel and the odd numbered symbols going to the Q channel. Because of the I/Q staggering, the first four samples (half symbol interval) carry nothing in the Q channel, while four extra samples are added at the end to finish the last Q channel symbol.

Data Field

Immediately following the packet header is the data field. The packet header specifies the length of the data field in bytes so the demodulator knows where the encoded field ends. If the packet is not addressed to the current station and is not monitoring the channel, it may simply purge the encoded data symbols, without demodulating and decoding them, and resume searching for a new packet.


There is no CRC in either the packet header or the data field. One of the properties of the sequential decoding of long constraint length nonsystematic convolutional codes is that undetected errors are highly unlikely; it is almost certain that the decoder will "time out" and fail to finish the packet rather than deliver a packet with errors. Such a decoding failure is taken as the equivalent of a CRC error.

Convolutional Encoding and Interleaving

The header and data fields are both convolutionally encoded with a rate 1/2, constraint length (K) 32 convolutional code. The code chosen is by Layland and Lushbaugh, with generator polynomials 0xf2d05351 and 0xe4613c47 (hex).

Each field is then interleaved using bit-reversed indexing. That is, the size of the field is rounded up to the next power of two. A bit-reversed counter, starting from zero, gives the position within the frame of a particular FEC symbol. Counter values beyond the end of the field are simply skipped. For example, if the field is 256 symbols long (as is the case for the 128-bit header once it has been convolutionally encoded), then the first encoded symbol goes into location 0 of the interleaved field. The second goes to location 64, and the third to 128. These values correspond to an 8-bit counter with the bits left-right reversed.

Here's a simpler example. If the encoded field were 5 symbols long, the interleaving order would be 0, 4, 2, 1, 3. That's because 5 rounded up to the next power of 2 is 8, or 2^3. An "ordinary" 3-bit binary counter would produce the binary values 000, 001, 010, 011, 100, 101, 110 and 111. Bit reversed, these would be 000, 100, 010, 110, 001, 101, 011, 111. Or in decimal, 0,4,2,6,1,5,3,7. But the values 5, 6 and 7 are out of range, so we skip those. The final interleaving order is therefore 0, 4, 2, 1, 3.

The purpose of interleaving is to put some time between subsequent symbols from the convolutional encoder. This protects against brief fades that take out several consecutive symbols. Convolutional decoders, especially sequential decoders, have trouble handling such burst errors. The interleaver has the effect of breaking these bursts into scattered single-bit errors that can be easily handled.

Some Implementation Issues

The modem does most of its DSP in floating point for two reasons. First, floating point avoids many of the scaling and overflow issues intrinsic to fixed point DSP. This is especially important on a general purpose CPU without the "saturation arithmetic" to handle the occasional overflow gracefully. [2]

Even more important, floating point is actually faster than integer arithmetic on the Intel Pentium. Code written to take advantage of the Pentium's floating point pipeline can rival the speed of many modern integer "DSP" chips.

More to Follow!

Mainly a discussion of the demod algorithms used to estimate carrier preamble noise, amplitude, frequency and phase (an iterated least-squares fit based on the exact same principles as those I used in 1983 to determine AO-10's orbit after the kick motor firing). Also the 4-phase decision-directed Costas loop, which works on the same principles except for a different "error" function that handles the 4-way phase ambiguity of QPSK modulation.

One brief remark about the estimation routines -- having an accurate estimate of signal amplitude and noise is absolutely invaluable. Because the loop filter "gain" in a Costas loop depends on the signal amplitude, having an accurate estimate of signal amplitude is of enormous help in making the algorithm converge rapidly on the correct phase, without under or overdamping. And having an accurate estimate of noise power is invaluable in correctly scaling the demodulator output for the soft-decision Fano sequential decoder. Fano decoders are inherently sensitive to errors in S/N estimates. So much more so than Viterbi decoders that in the past they've often operated with hard decision samples (where the signal amplitude has been stripped off) even though that results in a 2 dB hit in theoretical performance.


[1] Eb/N0 is the ratio of the energy per user data bit in joules (watt-seconds) to the noise spectral density in watts per hertz. Since "watts per hertz" has the same units as energy, the resulting ratio is dimensionless. It's usually expressed in decibels. The Eb/N0 ratio required by a modem is a fundamental figure of merit; the lower the Eb/N0 the less power is required to support a given user data rate on a given channel. Shannon's famous channel capacity theorem says that there is a minimum theoretical Eb/N0 requirement that any real modem must meet that is a function of the occupied bandwidth. To a certain extent, bandwidth can be traded for power. But even with infinite bandwidth, there is a fundamental lower limit on Eb/N0: -1.6 dB.

Uncoded BPSK and QPSK both require an Eb/N0 of about 9.6 dB to achieve a bit error rate of about 1 in 100,000.

[2] Intel recently announced a "MMX" extension to their CPU architecture specifically designed to enhance its integer DSP capabilities, including saturation arithmetic. But the goal here is to use widely available CPUs that have already come down the price curve.

[3] C/N0 is the ratio of total signal power to the noise spectral density. It is equal to Eb/N0 (expressed as a ratio) times the data rate. Since C/N0 is not dimensionless, it is usually expressed in dB-Hz, i.e., it is the signal-to-noise ratio that would be achieved if the entire signal were concentrated into an unmodulated carrier and received in a 1 Hz bandwidth.

[4] To save space, I envision encoding amateur callsigns much more compactly than in AX.25. All amateur callsigns use only 37 distinct characters: the 26 Latin letters and ten decimal digits. A 4-byte field can therefore represent log37(2^32) = 6.14 characters, which is long enough to handle any amateur callsign in the world with room left over for special addresses like "QST".

Last modified: 7 May 1996