Skip to main content

Exceptions of floating point normalization

Floating point normalization has a great usage for computing anything very near to accuracy. A floating point number is consists of:

  1. Mantissa or significand.
  2. Exponent.

Say, I've a number 123.75. Its a floating point number. It has integer significand, 12375 and exponent -2.

So arithmatic representation is 12375 x 10-2.

How to normalize a floating point number?

- By shifting the mantissa to left until a 1 appears in most significant bits(HO). Hence, the normalized representation will be 1.2375 x 10+2. Most of the time for normalized number this bit is hidden as it happens to be 1. This is hidden bit.

Now the question when we can't normalize a floating point number?

- There are two such situations:

  1. We can't normalize zero(0). The floating point representation of Zero doesn't contain any 1 bit. However, IEEE representation for +0 and -0 has different significance.
  2. We also can't normalize a floating point number whose most significant bits in mantissa are zero as well as biased exponents are also zero.
Reference: - Floating Point

Comments

good read , however you could have explained a little more about the exceptional situation you have mentioned.
we are waiting for your new post !

Popular posts from this blog

XOR (Exclusive OR) for branchless coding

The following example shows the array reversing using the  XOR operator . No need to take any additional variable to reverse the array.   int main(int argc, _TCHAR* argv[]) { char str[] = "I AM STUDENT"; int length = strlen(str); for(int i = 0; i < ((length/2)); i++) { str[i] ^= str[length - (1+i)]; str[length - (1+i)] ^= str[i]; str[i] ^= str[length - (1+i)]; } cout << str << endl; return 0; } The above example is one of the uses of XOR but XOR comes in handy when we can do branchless coding  methods like butterfly switch etc. Sometimes this is very effective in speeding up the execution.  Let's see one of the uses of XOR in branchless coding. I am taking a simple example of Y = | X |.  Yes, I am generating abs of a supplied number. So, my function signature/definition in C++ looks like below: int absoluteBranch( int x) {     if (x < 0 ) {         return ...

Reversing char array without splitting the array to tokens

 I was reading about strdup, a C++ function and suddenly an idea came to my mind if this can be leveraged to aid in reversing a character array without splitting the array into words and reconstructing it again by placing spaces and removing trailing spaces. Again, I wanted an array to be passed as a function argument and an array size to be passed implicitly with the array to the function. Assumed, a well-formed char array has been passed into the function. No malformed array checking is done inside the function. So, the function signature and definition are like below: Below is the call from the client code to reverse the array without splitting tokens and reconstructing it. Finally, copy the reversed array to the destination.  For GNU C++, we should use strdup instead _strdup . On run, we get the following output: Demo code

Power of Two

  I n this post will be discussing how to calculate if a number is a power of two or not. As an example, 8 is a power of two but the number 10 is not. There are many ways we can solve this. First , we will take an approach which is simple and iterative. In this case, we will calculate the power of two one by one and check with the supplied number. The below code illustrates it. bool isPowerofTwo(unsigned num) { auto y = 1; while (0 != y) { if (num == y) return true; if (num < y) return false; y <<= 1; } return false; } Second , assuming, the number is a 32-bit number, this is also an iterative solution. In this scenario, iterating all bits and counting the set bits. Any number which is a power of 2 will have only one bit set and the rest will be zeros. As an example, 8 in binary representation is 1000. Using this observation, we can implement an iterative solution. bool isPowerofTwo(unsigned num) { auto one_count = 0; for (auto index = 0; index < ...