# System Design Flow and Fixed-point Arithmetic Lecture 3

#### Dr. Shoab A. Khan

### Contents

- Floating and Fixed Point Arithmetic
- System Design Flow
  - Requirements and Specifications (R&S)
  - Algorithmic Development in Matlab and Coding Guidelines
- 2's Complement Arithmetic
- Floating Point Format
- Qn.m format for Fixed Point Arithmetic
- Addition, Multiplication and Scaling in Qn.m
- LTI systems and implementation in Qn.m format

## Floating & Fixed Point Arithmetic

- Two Types of arithmetic
  - **Floating Point Arithmetic** 
    - After each arithmetic operation numbers are normalized
    - Used where precision and dynamic range are important
    - Most algorithms are developed in FP
      - Ease of coding
    - More Cost (Area, Speed, Power)
  - Fixed Point Arithmetic
    - Place of decimal is fixed
    - Simpler HW, low power, less silicon
    - Converting FP simulation to Fixed-point simulation is time consuming
    - Multiplication doubles the number of bits
       NxN multiplier produces 2N bits
    - The code is less readable, need to worry about overflow and scaling issues

### System-level Design Flow and Fixed-point Arithmetic



## System Design Flow

- The requirements and specifications of the application are captured
- The algorithms are then developed in double precision floating point format
  - Matlab or C/C++
- A signal processing system general consists of hybrid target technologies
  - DSPs, FPGAs, ASICs
- For mapping application developed in double precision is partitioned into
  - hardware & software
- Most of signal processing applications are mapped on Fixed-point Digital Signal Processors or HW in ASICs or FPGAs
- The HW and SW components of the application are CONVERTED into Fixed Point format for this mapping

## 1st Step: Requirement & Specification

- Gathering R&S is the first part of the system design
- System components and algorithms are then selected that meets the requirements
- Example R&S of a UHF radio are listed in the table

| Characteristics   | Specification                       |
|-------------------|-------------------------------------|
| Output power      | 2W                                  |
| Spurious emission | < 60 dB                             |
| Harmonic          | > 55 dB                             |
| suppression       |                                     |
| Frequency         | 2 ppm or better                     |
| stability         |                                     |
| Reliability       | > 10,000 hours MTBF minimum         |
|                   | < 30 minutes MTTR                   |
| Handheld          | 12V DC nickel metal hydride, nickel |
|                   | cadmium or lithium-ion              |
|                   | (rechargeable) battery pack         |

## R&S of a UHF Radio (Cont)

| Characteristic    | Specification                                        |
|-------------------|------------------------------------------------------|
| Frequency range   | 420 MHz to 512 MHz                                   |
| Data rate         | Up to 512 kbps multi-channel non-line of sight       |
| Channel           | Multi-path with 15 µs delay spread and 220 km/h      |
|                   | relative speed between transmitter and receiver      |
| Modulation        | OFDM supporting BPSK, QPSK and QAM                   |
| FEC               | Turbo codes, convolution, Reed–Solomon               |
| Frequency hopping | > 600 hops/s, frequency hopping on full hopping band |
| Waveforms         | Radio works as SDR and should be capable of          |
|                   | accepting additional waveforms                       |

### Next Step: Algorithm Development and Mapping

- The R&S related to digital design are forwarded to algorithm developers and system designers
- Algorithms are coded in behavioral modeling tools like Matlab
- The Matlab code is then translated into a high level language, for example, C/C++
- System is designed based on R&S
  - System usually consists of hybrid technologies consisting of ASICs, DSPs, GPP, and FPGAs
- Partitioning of the application into HW/SW parts is performed
- The SW is then developed for the SW part and architectures are designed and implemented for the HW parts
- Integration and testing is performed throughout the design cycle

### Guidelines for Coding Algorithms in Matlab

- Signal processing applications are mostly developed in Matlab
- As the Matlab code is to be mapped in HW and SW so adhering to coding guidelines is critical
  - The code must be designed to work for processing of data in chunks
  - The code should be structured in distinct components
    - Well defined interfaces in terms of input and output arguments and internal data storages
  - All variables and constants should be defined in data structures
    - User defined configurations in one structure
    - System design constants in another structure
    - Internal states for each block in another structure
  - Initialization in the start of simulation

## Processing in Chunks

|   | % BPSK = 1, QPSK = 2, 8PSK = 3, 16QAM = 4                                    |
|---|------------------------------------------------------------------------------|
|   | <pre>% All-user defined parameters are set in structure USER_PARAMS</pre>    |
|   | USER_PARAMS.MOD_SCH = 2; %select QPSK for current simulation                 |
|   | USER_PARAMS.CHUNK_SZ = 256; %set buffer size                                 |
| - | USER_PARAMS.NO_CHUNKS = 100;% set no of chunks for simulation                |
|   | % generate raw data for simulation                                           |
|   | <pre>raw_data = randint(1, USER_PARAMS.NO_CHUNKS*USER_PARAMS.CHUNK_SZ)</pre> |
|   | % Initialize user defined, system defined parameters and states              |
|   | <pre>PARAMS = MOD_Params_Init(USER_PARAMS);</pre>                            |
|   | <pre>STATES = MOD_States_Init(PARAMS);</pre>                                 |
|   | <pre>mod_out = [];</pre>                                                     |
|   | % Code should be structured to process data on chunk-by-chunk basis          |
|   | <pre>for iter = 0:USER_PARAMS.NO_CHUNKS-1</pre>                              |
|   | in_data = raw_data                                                           |
|   | (iter*USER_PARAMS.CHUNK_SZ+1:USER_PARAMS.CHUNK_SZ*(iter+1));                 |
|   | [out_sig,STATES]= Modulator(in_data,PARAMS,STATES);                          |
|   | <pre>mod_out = [mod_out out_sig];</pre>                                      |
|   | end                                                                          |

Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan

### Parameters & Initialization

```
% Initializing the user defined parameters and system design parameters in PARAMS
function PARAMS = MOD Params Init(USER PARAMS)
% Structure for transmitter parameters
PARAMS.MOD SCH = USER PARAMS.MOD SCH;
PARAMS.SPS = 4; % Sample per symbol
% Create a root raised cosine pulse-shaping filter
PARAMS.Nyquist filter = rcosfir(.5, 5, PARAMS.SPS, 1);
% Bits per symbol, in this case bits per symbols is same as mod scheme
PARAMS.BPS = USER PARAMS.MOD SCH;
% Lookup tables for BPSK, QPSK, 8-PSK and 16-QAM using gray coding
BPSK Table = [(-1 + 0*j) (1 + 0*j)];
QPSK Table = [(-.707 - .707*j) (-.707 + .707*j) (.707 - .707*j) (.707 + .707*j)];
PSK8 Table = [(1 + 0j) (.7071 + .7071i) (-.7071 + .7071i) (0 + i)...
             (-1 + 0i) (-.7071 - .7071i) (.7071 - .7071i) (0 - 1i)];
OAM Table = [(-3 + -3*j) (-3 + -1*j) (-3 + 3*j) (-3 + 1*j) (-1 + -3*j)...
            (-1 + -1*j) (-1 + 3*j) (-1 + 1*j) (3 + -3*j) (3 + -1*j)...
            (3 + 3*j) (3 + 1*j) (1 + -3*j) (1 + -1*j) (1 + 3*j) (1 + 1*j)];
% Constellation selection according to bits per symbol
if (PARAMS.BPS == 1)
    PARAMS.const Table = BPSK Table;
elseif(PARAMS.BPS == 2)
    PARAMS.const Table = QPSK Table;
elseif(PARAMS.BPS == 3)
    PARAMS.const Table = PSK8 Table;
elseif(PARAMS.BPS == 4)
    PARAMS.const Table = QAM Table;
else
    error('ERROR!!! This constellation size not supported')
end
```

## States & Initializations

- function STATES = MOD\_States\_Init(PARAMS)
- % Pulse shaping filter delayline
- STATES.filter\_delayline =

zeros(1,length(PARAMS.Nyquist\_filter)-1);

## Chunk by chunk processing

```
function [out_data, STATES] = Modulator(in_data, PARAMS, STATES);
% Bits to symbols conversion
sym = reshape(in data, PARAMS.BPS, length(in data)/PARAMS.BPS)';
% Binary to decimal conversion
sym decimal = bi2de(sym);
% Bit to symbol mapping
const sym = PARAMS.const Table(sym decimal+1);
% Zero padding for up-sampling
up sym = upsample(const sym,PARAMS.SPS);
% Zero padded signal passed through Nyquist filter
[out data, STATES.filter delayline] =
filter(PARAMS.Nyquist_filter,1,up_sym, STATES.filter_delayline);
```

## Fixed-point v/s Floating-point Hardware

- Algorithms are developed in floating point format using tools like Matlab
- Floating point processors and HW are expensive
- Fixed-point processors and HW are used in embedded systems
- After algorithms are designed and tested then they are converted into fixed-point implementation
- The algorithms are ported on Fixed-point processor or application specific hardware

### **Digital Signal Processors**



- In a digital design fixed or floating point numbers are represented in binary format
- Types of Representation
  - one's complement
  - sign magnitude
  - canonic sign digit (CSD)
  - two's complement
- In digital system design for fixed point implementation the canonic sign digit (CSD), and two's complement are normally used

### 2's Complement Arithmetic



## 2's Complement Arithmetic

- MSB has negative weight,
  - Positive number  $a_{N-1} = 0$
  - Negative number  $a_{N-1} = 1$

## Example



## **Equivalent Representation**

Many design tools do not display numbers as 2's complement signed numbers A signed number is represented as an equivalent unsigned number Equivalent unsigned value of an N-bit negative number is

2<sup>N</sup> - |a|

Example for -5 = 1011

N=4 a=-5 2<sup>4</sup> - |-5| = 16 - 5= +11

In binary it is equivalent rep is 1011

#### Four-bit representation of two's complement and equivalent unsigned numbers

| Decimal | Two' | Two's complement representation |                |                |                | Equivalent unsigned |
|---------|------|---------------------------------|----------------|----------------|----------------|---------------------|
| number  |      |                                 |                |                |                | number              |
|         |      | - <b>2</b> 3                    | 2 <sup>2</sup> | 2 <sup>1</sup> | 2 <sup>0</sup> |                     |
| 0       |      | 0                               | 0              | 0              | 0              | 0                   |
| +1      |      | 0                               | 0              | 0              | 1              | 1                   |
| +2      |      | 0                               | 0              | 1              | 0              | 2                   |
| +3      |      | 0                               | 0              | 1              | 1              | 3                   |
| +4      |      | 0                               | 1              | 0              | 0              | 4                   |
| +5      |      | 0                               | 1              | 0              | 1              | 5                   |
| +6      |      | 0                               | 1              | 1              | 0              | 6                   |
| +7      |      | 0                               | 1              | 1              | 1              | 7                   |
| -8      |      | 1                               | 0              | 0              | 0              | 8                   |
| -7      |      | 1                               | 0              | 0              | 1              | 9                   |
| -6      |      | 1                               | 0              | 1              | 0              | 10                  |
| -5      |      | 1                               | 0              | 1              | 1              | 11                  |
| -4      |      | 1                               | 1              | 0              | 0              | 12                  |
| -3      |      | 1                               | 1              | 0              | 1              | 13                  |
| -2      |      | 1                               | 1              | 1              | 0              | 14                  |
| -1      |      | 1                               | 1              | 1              | 1              | 15                  |

### Computing Two's Complement of a Signed Number

- Refers to the negative of a number
- Computes by inverting all bits and adding 1 to the least significant bit (LSB) of the number
- This is equivalent to inverting all the bits while moving from LSB to MSB and leaving the least significant 1 as it is
- From the hardware perspective, adding 1 requires an adder in the HW and is an expensive preposition

Scaling to Avoid Overflows

Overflow adds an error that equals to the complete dynamic range of the number

- To avoid overflow, numbers are scaled down before mathematical operations
- Sign Extension

Numbers may also need to be sign extended

- An N bit number is extended to an M bit number M > N, by replicating M-N sign bits to the most significant bit positions
  - Positive number: M-N extended bits are filled with 0s
    - The number unsigned value remains the same
  - Negative number: M-N extended bits are filled with 1s,
    - Signed value remains the same
    - Equivalent unsigned value is changed

4'b1000 2's complement sign number is sign extend to 8'b1111\_1000

## Dropping Redundant Sign bits

- When a number has redundant sign bits, these redundant bits can be dropped
- This dropping of bits does not affect the value of the number

#### Example

8'b1111\_1000 = -8

Is same as

4'b1000 = -8

## Floating Point Format

- Floating point arithmetic is appropriate for high precision applications
- Applications that deals with number with wider dynamic range
- A floating point number is represented as

$$x = (-1)^s \times 1 \times m \times 2^{e-b}$$

- s represents sign of the number
- m is a fraction number >1 and < 2</p>
- e is a biased exponent, always positive
  - The bias **b** is subtracted to get the actual exponent

## IEEE floating point format for single precision 32bit floating point number



### **Example Floating Point Representation**

### 32-bit IEEE Floating point number is 0\_10000010\_11010000\_00000000\_0000000

| Sign bit                   | Exponent                        |   | Mantissa                                   |
|----------------------------|---------------------------------|---|--------------------------------------------|
| 0                          | 10000010                        |   | 11010000_0000000_0000000                   |
| ( <b>—</b> 1) <sup>0</sup> | × <b>2</b> <sup>(130-127)</sup> | × | (1 . 1 1 0 1) <sub>2</sub>                 |
| 1                          | × <b>2</b> <sup>(3)</sup>       | × | (1 + 0.5 + .25 + 0 + 0.0625) <sub>10</sub> |
| (+1)                       | × <b>2</b> <sup>(3)</sup>       | × | (1.8125) <sub>10</sub>                     |

## Floating-point Arithmetic Addition

- **S0:** Append the implied 1 of the mantissa
- **S1:** Shift the mantissa from S0 with smaller exponent es to the right by el es,

where el is the larger of the two exponents

**S2:** For negative operand take two's complement and then add the two mantissas.

If the result is negative, again takes two's complement of the result for storing in IEEE format

**S3:** Normalize the sum back to IEEE format by adjusting the mantissa

and appropriately changing the value of the exponent e

**S4:** Round or truncate the resultant mantissa to fit in IEEE format

## Example: Floating-point Addition

Add two floating point numbers in 10-bit precision 4 bits for exponent and 5 bits for the mantissa and 1 bit for the sign, bias value is 7. 0 1010 00101 0 1001 00101 Taking the bias +7 off from the exponents and appending the implied 1, the numbers are  $1.00101_{h} \times 2^{3}$  and  $1.00101_{h} \times 2^{2}$ S0: Align two exponents by shifting the mantissa of the number with smaller exponent accordingly  $1.00101_{h} \times 2^{3}$ => 1.001010<sub>b</sub> $\times$  2<sup>3</sup> 1.00101<sub>b</sub> × 2<sup>2</sup> =>  $0.100101_{b} \times 2^{3}$ **S1:** Add the mantissas  $1.001010_{\rm b} + 0.100101_{\rm b} = 1.101111_{\rm b}$ **S2:** Drop the guard bit  $1.10111_{\text{b}} \times 2^3$ **S3:** The final answer is  $1.10111_{h} \times 2^{3}$  which in 10-bit format of the operands is 0 1010 10111

## Floating-point Multiplication

- **S0:** Add the two exponents e1 and e2 and subtract the bias once
- S1: Multiply the mantissas as unsigned numbers to get the product, and XOR the two sign bits to get the sign of the product
- **S2:** Normalize the product if required
- **S3:** Round or truncate the mantissa

# Fixed-point format

Algorithms in double precision Floatingpoint format are converted into Fixed-Point format for mapping on low cost DSPs and Application Specific HW designs

## Qn.m Format for Fixed-point Arithmetic

- Qn.m format is a fixed positional number system for representing floating-point numbers
- A Qn.m format N-bit binary number assumes n bits to the left and m bits to the right of the binary point



## Qn.m Positive Number

- The MSB is the sign bit
- For a positive fixed-point number, MSB is 0

 $b = 0b_{n-2}\dots b_1b_0.b_{-1}b_{-2}\dots b_{-m}$ 

 Equivalent floating point value of the positive number is

$$b = b_{n-2} 2^{n-2} + \dots + b_1 2^1 + b_0 + b_{-1} 2^{-1} + b_{-2} 2^{-2} + \dots + b_{-m} 2^{-m}$$

 For Negative numbers, MSB has negative weight and equivalent value is

 $b = -b_{n-1}2^{n-1} + b_{n-2}2^{n-2} + \dots + b_12^1 + b_0 + b_{-1}2^{-1} + b_{-2}2^{-2} + \dots + b_{-m}2^{-m}$ 

### Conversion to Qn.m

1. Define the total number. of bits to represent a Qn.m number

```
— — (assume 10 bits)
```

2. Fix location of decimal based on the value of the number



## Example



Equivalent fixed-point integer value is

#### 0x1D0

- 2 bits for the integer and remaining 8 bits keeps the fraction part
- A 10 bit Q2.8 signed number covers -2 to +1. 9922.
- Increasing the fractional bits increases the precision.
#### In Q<sub>n.m</sub> format,

#### n entirely depends upon the range of integer

m defines the precision of the fractional part



# Fixed-point numbers in COTS DSPs

- Commercially available off the shelf processors usually have16 bits to represent a number
- In C, a programmer can define 8, 16 or 32 bit numbers as char, short and int/long respectively
- In case a variable requires different precision than what can be defined in C, the number is defined in higher precision
- For example an 18-bit number should be defined as a 32-bits integer
- High precision arithmetic is performed using low precision arithmetic operations
  - 32-bit addition requires two 16-bit addition
  - 32-bit multiplication requires four 16-bit multiplications

# Floating Point to Fixed Point Conversion

- Serialize the floating-point code to separate all atomic computations and assignments to variables
- Insert range directives after each serial floating-point computation
- Runs the serialized implementation with range directives for all set of possible inputs
- Convert all floating point varilables to fixed point format using the maximum and minimum values for each variable
- The integer part for var<sub>i</sub> is defined as

$$n_{i=\log_2}(\max|\max_{val_i}, \min_{val_i})|) + 1$$

- The fractional part m requires detail analysis of Signal to Quantization Noise analysis
  - Active area of research
  - Model based, exhaustive search based or optimization based techniques



## Range Determination for Qn.m Format

#### Run simulations for all set of inputs



# Floating-point to Fixed-point Conversion

- Shift left *m* fractional bits to the integer part and truncate or round the results
  - In Matlab, the conversion is done as

```
num_fixed = fix(num_float * 2<sup>m</sup>) or
```

```
num_fixed = round(num_float * 2<sup>m</sup>)
```

 Saturate if number is greater than the maximum positive number or less than the minimum negative number.

```
if (num_fixed > 2^{N-1} - 1)

num_fixed = 2^{N-1} - 1

else if (num_fixed < -2^{N-1})

num_fixed = -2^{N-1}
```

## Matlab Support for Fixed-Point Arithmetic

# Matlab Supports Fixed Point Arithmetic

- All the attributes of fixed point variable can be set for fixed point arithmetic
- These attributes are shown here

```
PT =
    3.1563
              DataType: Fixed
               Scaling: Binary Point
                Signed: true
            WordLength: 8
        FractionLength: 5
             RoundMode: round
          OverflowMode: saturate
           ProductMode: Full Precision
 MaxProductWordLength: 128
               SumMode: Full Precision
      MaxSumWordLength: 128
         CastBeforeSum: true
```

#### Conversion using C Program

```
num_fixed_long = (long)(num_float * 2<sup>15</sup>)
if (num_fixed_long > 0x7fff)
    num_fixed_long = 0x7fff
elseif (num_fixed_long < 0xffff8000)
    num_fixed_long = 0xffff8000</pre>
```

```
num_fixed_Q15 = (short)(num_fixed_long & 0xffff)
```

Using this logic, the following lists a few floating-point numbers (num\_float) and their representation in Q1.15 fixed-point format (num\_fixed\_Q15):  $0.5 \rightarrow 0x4000$  $-0.5 \rightarrow 0xC000$  $0.9997 \rightarrow 0x7FFF$  $0.213 \rightarrow 0x1B44$  $-1.0 \rightarrow 0x8000$ 

# SystemC supports Fixed-point

- SystemC provides a very convenient tool to covert a floating point implementation to fixed point format
- Uses floating point implementation with redefined floating point variables

#### SystemC supports Fixed-point

```
int sc main (int argc , char *argv[])
    {
    sc fixed <16,10, SC TRN, SC SAT> fx value;
    double fp value;
    int i, size;
         ofstream fout("fx file.txt");
         ifstream fin("fp file.txt");
         if (fin.is open())
                  fin >> size;
         else
                  cout << "Error opening input file!\n";</pre>
         cout << "size = " << size << endl;</pre>
         for (i=0; i<size; i++)</pre>
         {
                  if (!fin.eof())
                   {
                            fin >> fp value;
                            fx value = fp value;
                            cout << "double = " << fp value"\t fixpt =</pre>
"<<fx value<<endl;
                            fout << fx value<< endl;</pre>
                   }
```

## Arithmetic: Addition in Q Format

Addition of two fixed-point numbers a and b of Qn1.m1 and Qn2.m2 formats, respectively, results in a Qn.m format number, where n is the larger of n1 and n2 and m is the larger of m1 and m2.

#### Example

| implied decimal   |   |   |   |     |   |   |   |   |                                     |
|-------------------|---|---|---|-----|---|---|---|---|-------------------------------------|
| $Qn_1.m_1$        | 1 | 1 | 1 | 1 • | 1 | 0 |   |   | = Q4.2 = -2+1+0.5 = -0.5            |
| $Qn_2.m_2$        | 0 | 1 | 1 | 1   | 0 | 1 | 1 | 0 | = Q4.4 = 1+2+4+025+0.125 = 7.375    |
| Qn <sub>.</sub> m | 0 | 1 | 1 | 0   | 1 | 1 | 1 | 0 | = Q4.4 = 2+4+0.5+0.25+0.125 = 6.875 |

# Multiplication in Q Format

 Multiplications of two numbers in Qn1.m1 and Qn2.m2 formats generates product in

Q(n1 + n2).(m1 + m2) format

- For signed x signed multiplication the MSB is a redundant sign bit
  - First Drop this bit by shifting the result to the left
  - Then Drop less precision bits to bring the number back to less number of bits

#### Multiplication in *Q*-Format

#### $Q_{n1.m1} X Q_{n2.m2} = Q (n1+n2) . (m1+m2)$

Four types of Fractional Multiplication:

Unsigned Unsigned

Unsigned Signed

Signed Unsigned

Signed Signed

Signed x Signed multiplication, results in a redundant sign bit

# Unsigned by Unsigned

 The partial products are added without any sign extension logic

# Signed by Unsigned

- Sign extension of each partial product is necessary in signed-unsigned multiplication.
- The partial products are first sign-extended and then added

# Unsigned by Signed

- The unsigned multiplicand is changed to a signed positive number
  - its 2's complement is computed if the signed number is negative
- The last product is a signed number

1 0 0 1 = 10.01 in Q2.2 = 2.25 (unsigned) <u>1 1 0 1</u> = 11.01 in Q2.2 = -0.75 (signed) <u>1 0 0 1</u> <u>0 0 0 0 X</u> <u>1 0 1 1 X X X</u> <u>1 0 1 1 1 X X X</u> <u>2's compliment of the positive multiplicand 01001</u> <u>1 1 1 0 0 1 0 1 = 1110.0101 in Q4.4 i.e.-1.6875</u>

# Signed by Signed

- Sign extend all partial products
- Takes 2's complement of the last partial product if multiplier is a negative number.
- The MSB of the product is a redundant sign bit
  - Removed the bit by shifting the product to left, the product is in

#### Example: Signed x Signed

$$1 1. 0 1 = -0.75 \text{ in } Q2.2 \text{ format}$$

$$1. 1 0 1 = -0.375 \text{ in } Q1.3 \text{ format}$$

$$1 1 1 1 1 1 0 1$$

$$0 0 0 0 0 0 0 X$$

$$1 1 1 1 0 1 X X$$

$$0 0 0 1 1 X X X$$

0 0 0 0 1 0 0 1 = shifting left by one 00.010010 in Q2.6 format is 0.2815

#### Corner Case:

Signed-Signed Fractional Multiplication

 -1x-1 = -1 in Q fractional format is a corner case, it should be checked and result should be saturated as max positive number

|   |   |   | 1 | 0 | 0 | = Q1.2 = -1                                                 |
|---|---|---|---|---|---|-------------------------------------------------------------|
|   |   |   | 1 | 0 | 0 | = Q1.2 = -1                                                 |
| 0 | 0 | 0 | 0 | 0 | 0 | -                                                           |
| 0 | 0 | 0 | 0 | 0 | Х |                                                             |
| 0 | 1 | 0 | 0 | Х | Х | 2's Compliment                                              |
| 0 | 1 | 0 | 0 | 0 | 0 | Dropping redundant sign bit<br>results 1000000 = -1 in Q1.5 |

#### **Fixed Point Multiplication**

```
Word32 L mult(Word16 var1,Word16 var2)
  {
  Word32 L var out;
  L var out = (Word32)var1 * (Word32)var2;
   if (L var out != (Word32)0x4000000L) // 0x8000 x 0x8000 =
  0x40000000
     {
       L var out *= 2; //remove the redundant bit
     }
  else
     ł
       Overflow = 1;
       L var out = 0x7fffffff; //if overflow then clamp to max +ve value
                   }
   return(L var out);
  }
```

#### Fractional and Integer Arithmetic and truncation

| 4-bit Integer Arithmetic          | Q1.3 Fractional Arithmetic                                 |
|-----------------------------------|------------------------------------------------------------|
| $1001 = -7_{10}$                  | $1.001 = -0.875_{10}$                                      |
| $0111 = +7_{10}$                  | $0.111 = 0.875_{10}$                                       |
| 11111001                          | 11.111001                                                  |
| 1111001X                          | 11.11001X                                                  |
| 111001XX                          | 11.1001XX                                                  |
| $11001111 = -49_{10}$             | $11.001111 = -0.765625_{10}$                               |
| 11001111 ⇔ 1111<br>⇔ -1(overflow) | 11.001111 	⇒ Q1.3 format<br>⇒ 1.001 = -0.875 <sub>10</sub> |

 4x4 bit multiplication can easily trimmed to get in 4 bit precision for fractional multiplication but for integer it may cause overflow

# Unified Multiplier

- The multiplier supports integer and fractional multiplication
- All types of operands
  - SxS, SxU, UxS, UxU
- Checks the corner case of fractional multiplication



# Bit Growth in Fixed-point Arithmetic

- Bit growth is one of the most critical issues in fixed-point arithmetic.
- Multiplication of an N-bit number with an M-bit number results in an (N+M)-bit number
  - In recursive system each iteration increases the width

$$y[n] = ay[n-1]+x[n]$$

# Bit Growth in Q-format arithmetic

- Multiplication of two N1 and N2 bit numbers in Qn1.m1 and Qn2.m2 results in N1+N2 bit number
  - Q(n1+n2-1).(m1+m2+1) for SxS and Q(n1+n2).(m1+m2) for the rest
- Where as addition results is

```
Qn.m = Q max(n1,n2).max(m1.m2)
```

For LTI system Schwarz's inequality is used for estimating the bit growth



#### Truncation

- In multiplication of two Q format numbers as the number of bits in the product increases
- We sacrifice precision by throwing some low precision bits of the product
- Qn1.m1 is truncated to Qn1.m2 where m2 < m1</p>

Let the product is

```
8'b01110110 (Q<sub>4.4</sub>)
```

```
4 + 2 + 1 + 0.25 + 0.125 = 7.375
```

Truncate it to Q<sub>4.2</sub> results

```
6'b011101 = 7.25
```

#### Rounding with Truncation

- Sometimes we truncate with rounding Example
  - 3.726 can be rounded off to 3.73
- We do similar things in Digital Design that is "First Round off then Truncate"
- Before rounding add 1 to the right side of the truncation point and then truncate

#### Equivalent Q formats

 In many cases Q format of a number is to be changed

Convert  $Q_{n1.m1}$  to  $Q_{n2.m2}$ 

- If n2>n1, we simply append sign bits=n2-n1 to the MSB location of n1
- If m2>m1 we simply append zeros to the LSB locations of the Fractional part of m1

# Example

Let a=11.101 (Q2.3)

is supposed to be added to a number b of  $Q_{7.7}$  format. So extending a will result in:

1111111.1010000 ( Q<sub>7.7</sub> )





#### **Overflow and Saturation**





#### 2's Complement Intermediate Overflow Property

In an iterative calculation using 2's complement arithmetic if it is guaranteed that the final result will be within precision bound of assigned fixed-point format then any number of intermediate overflows will not affect the final answer.

| 1.75<br>+1.25 | Q2.2<br>Q2.2 | 0111 2.50<br>0101 3.50 | ) intermediate overflow     |
|---------------|--------------|------------------------|-----------------------------|
| 3.00          | Q2.2         | 1100 -1.0              | 0                           |
| 3.00          | Q2.2         | 1100 -1.0<br>1011 -1.2 | 0<br>5 correct final answer |
| 1.25          | Q2.2<br>Q2.2 | 0111 1.7               |                             |



#### Example use of Intermediate Overflow property: CIC Filter

• The transfer function of a CIC filter is:

$$H(z) = 2^{-D} \left[ \frac{1 - z^{-L}}{1 - z^{-L}} \right]^{M}$$

For signed Qn.m input, the output fits in Qn.m format provided each input sample is scaled by D, where:

$$2^{-D} \le \left(\frac{1}{L}\right)^M$$

# An Mth-order CIC filter for decimating a signal by a factor of L (here, M=3)


# Bit Growth in Q-format arithmetic

 For LTI system Schwarz's inequality is used for estimating the bit growth

$$|y n| \leq \sqrt{\sum_{n=-\infty}^{\infty} h^2} n = \sqrt{\sum_{n=-\infty}^{\infty} x^2} n$$

## FIR Filter Example

 For Q1.m format inputs the outputs will mostly fit in Q1.15 if the design can guarantee that the sum of all the coefficients of an FIR filter is 1; that is:

$$\int_{n=0}^{L-1} h \quad n = 1$$

• All filters design in Matlab observe this property

>> sum(fir1(20,.1))

ans = 1.0000

# **Block Floating Point**

- Block floating-point format improves the dynamic range of numbers
- Useful in design where block of data goes though different stages of computation
- A block of data is represented with a common exponent
- All the values in the block are in fixed-point format.
- Each stage computes the number of redundant sign bits of all the values in the block

## **Block Floating Point**



## Example: Block Floating Point in FFT



# Filter Structures: DF-I and DF-II

Many ways to implement an IIR filter with

$$H(z) = \frac{Y(z)}{X(z)} = \frac{\sum_{k=0}^{N} b_{k^{z^{-k}}}}{1 + \sum_{k=1}^{M} a_{k^{z^{-k}}}}$$

$$y \not n = \sum_{k=0}^{N} b_k x \not n - k = \sum_{k=1}^{M} a_k y \not n - k$$

- All are analytical equivalent
- Direct Form-I and Direct form II or two implementations

## Filter Structures: DF-I and DF-II



# TDF-II

- Transposed Direct Form-II is another implementation that reduces the number of delays
- DF-I, DF-II, TDF-I and TDF-II all suffers from coefficient quantization
  - A filter designed using double precision may get unstable after quantization



# **Direct Fixed-point Implementation**

- A double precision filter gets unstable once quantized to even 24-bit precision
- Each coefficient quantization effects all poles
- Some of the poles may move outside the unit circle



## Second Order Cascaded Sections

- Conversion to Second Order Sections before quantization is highly desired
- Quantization of coefficient effects only the conjugate pole pair



## Parallel Implementation

 A parallel form also provides same benefits relating to stability



# Second Order Sections

- The 8<sup>th</sup> order IIR filter is designed using 2<sup>nd</sup> order section
- The direct form results in instability at 24-bit precision
- The 2<sup>nd</sup> order is stable even at 8-bit precision



## Matlab FDA Tool Box

- Matlab FDA tool box can be used to design and analyze floating and fixed point fitler
- The tool can used to port the coefficients in Matlab workspace, C/C++ header file or Verilog file
- The tool also generates fully parallel implementation in Verilog





#### Folded Structures

- Linear phase FIR filter observes symmetry or anti-symmetry
- All LP filters can be folded for reducing the number of multipliers



#### Contd...



# Summary

- Capturing Requirements & Specifications is the first step in system design
- A digital system usually consists of hybrid technologies and uses GPP, DSPs, FPGAs and ASICs
- The algorithms are developed in double precision floating point format
- Floating point HW and DSPs are expensive, for DSP applications Fixed point arithmetic is preferred
- The floating point code is converted to fixed point format and all variables are defined as Qn.m format numbers
- The fixed point arithmetic results in bit growth, the results from arithmetic computations are rounded and truncated
- In IIR filter implementation 2<sup>nd</sup> order and Parallel forms minimizes coefficient quantization
- Implementation of LP FIR should use symmetry and anti-symmetry of coefficients
- Block floating point can improve the precision for block processing of data