Monday, June 11, 2012

10 tips to improve Performance of Shell Scripts



   If writing good Shell scripts is a skill, writing better performance shell scripts is an art.  Shell scripts are any Unix admin's asset. It is a major time saver for repetitive activities in Unix flavors. When a shell script is written, by default, the big target in-front of the developer is to make the script running, and of course, to give the right result. As long as these scripts are intended for some user account activities which does not involve huge data, it does not matter how the result is arrived at. However, the same cannot be said on those Shell scripts which are meant to be executed on production servers containing millions of customer records present in huge files.
 
     A Shell script can give the right result, but if the result does not arrive in a considerable time, it is an issue, and yes, it can become a nightmare for the support staff. Every second is important in a production environment. A script which can be improved by even some 10 seconds can be considered an improvement depending on the environment. In this article, we will discuss about some import things which a shell script developer should keep in mind while writing shell scripts to give better performance.

1. Better file handling: Knowingly or unknowingly, a lot of files are created or deleted in a shell script. Due to the use of large number of files, handling of files become very important. Even a simple echo statement which re-directs output to a file has to open the file first, write data into the file and close the file. Let us look at an example:
#!/usr/bin/sh

cnt=1
while [ $cnt -ne 100 ]
do
        echo $cnt >> file
        let cnt=cnt+1
done
     This script is simple to comprehend. Inside a while loop, at every increment of the count value, the variable is written or appended to the file. The loop is run for 100 times. Every time, the output file is opened, data is written and the file is closed. And yes, this happens for 100 times. In practical cases, where the loop is running on millions of records, this could be a huge time elapsed.

   Now, look at the example below which is an improved version of the above:
#!/usr/bin/sh

cnt=1
while [ $cnt -ne 100 ]
do
        echo $cnt
        let cnt=cnt+1
done > file
    Instead of writing the count at every instance, it is written once at the end of the while loop. Every time when the echo statement is executed, the output printed by echo remains in the buffer. When the while loop finishes, the entire buffer contents gets written into the file. Compare this solution to the above and just imagine the performance improvement.
Tip: Whenever you have a print statement in a loop, try checking whether they can be put in a better place.

2. Use ONLY the necessary commands inside a loop: It means not to use those commands which are not needed inside the loop. But, its  common sense right? The point here is: Do not use commands inside loop which could very well have been outside loop. Let us consider the below example:
$ cat test.sh
#!/usr/bin/sh

cnt=1
for i in `cat file`
do
        DT=`date '+%Y%m%d'`
        FILE=${i}_${DT}
        echo $cnt > $FILE
        let cnt=cnt+1
done
   What the script does is: A loop is run on the contents of file. For every entry in the file, a new filename is prepared which is nothing but the concatenation of the entry in the file along with the date. And a value is written to this new file. It looks simple and without any issues. No issues? Look at it again.

  Why is the date command inside the loop? The date command is fetching the year, month and date which is going to remain the same (unless until script run through midnight in which case I do not think will be a genuine requirement here). The date could have very well served the purpose being outside the loop as well. If the input file on which the loop is running is to contain some 1000's of records, the date command is going to run 1000's of times where actually we wanted it only once. A huge performance issue. The improvement here would be to simply move the highlighted line before the for loop.

  Tip : Always make it a point to have only the relevant commands at the right places.

3. One file to multiple files? Walk over the entire script: Whenever we have a requirement to write a script which is going to run on lots of files, the developer mindset is to write the script and make it work for one file. Once it works with one file, make it work for multiple files either by adding a loop or by passing command line arguments. Let us take the earlier example for now. The developer could well have written the lines inside the loop by hardcoding the variable "i", and once it is working, just enclosed the code inside a for loop. Due to this approach, the date command remained inside. . Lots of performance issue happens due to this reason.
  Tip : Always make it a point, to walk over the entire script the moment a loop is put on an existing set of codes. You will be surprised to see many lines of code being irrelevant inside the loop.

4. Best option vs any option: The beauty in Unix or Unix flavors is it gives multiple options to achieve anything in it. Regular readers of this blog would have come across the umpteen articles on stuff wherein we explained the different ways in which a particular output can be achieved.
    Let us consider a file. The requirement is to parse the file and read the 2 values into 2 different variables:
$ cat file
Solaris 25
Linux 21
AIX 40
    Approach 1: Using the while loop:
$cat test.sh
#!/bin/bash

while read line
do
        OS=`echo $line | awk '{print $1}'`
        VALUE=`echo $line | awk '{print $2}'`
        echo "OS : $OS"
        echo "VALUE: $VALUE"
done < file
     The while loop reads every line into a variable, and then using awk filters out each variable and stores it separately into two variables, OS and value. Frankly speaking, we wasted 2 awk commands here. The awk was not needed at all. Check out the next approach.

 Approach 2: Using the same while loop, but without awk:
$cat test.sh
#!/bin/bash

while read OS VALUE
do
        echo "OS : $OS"
        echo "VALUE: $VALUE"
done < file
    As seen in one of our articles, while loop has all the properties to read a text file or CSV file effeciently as shown above. And hence, the awk is not needed at all. Both the above methods provide the right result. Which is better? Very simple,  the second one because we achieved the result with much lesser number of commands.

   Tip : Whenever, you achieve a result with a series of commands piped to each other, try looking for different options to see whether it can be improved.

Note: It always does not mean that the one with more commands will take lot of time than the one with lesser number of commands. If the option with a single command is written poorly, it can equally mess-up.

5. Wherever possible, internal command always: In one of our articles, we saw the difference between internal and external commands.  Internal commands are internal to the shell which the shell executes without creating any process whereas for the external commands, a process is created. Due to this, internal commands are always much much faster compared to external command.
 Example 1 to find the length of a string:
$ x="welcome"
$ expr $x : '.*'
7
$ echo ${#x}
7
2 different commands are used. One using expr which is an external command, other using echo which is an internal command.  The echo will be much better in performance than the expr.
 Example 2 to read a file line by line in Shell:  
 Option 1:
$ cat file | while read line
> do
>  echo $line
> done
  Option 2:
$ while read line
> do
>   echo $line
> done < file
  In the first option, we use the cat command and pipe the output to the while command. However, in the Option 2, it is purely internal where the file is read using the input file descriptor.
   Tip : Always prefer internal commands if possible.

6. Avoid Useless use of any command: There is a popular term known as UUC which stands for Useless use of cat. This means using cat command when actually it is not needed at all. For example:
$ cat file | grep Linux
  This command could very well have been 'grep Linux file'. This is called UUC. Actually, if you look a little carefully,  we might use many other commands which are pretty useless. Many more instances like these, many such useless commands we will come across. Let us look at another example of same kind:
$ grep Linux file | awk '{print $2}'
could very well have been:
$ awk '/Linux/{print $2}' file
   Never use commands which are not needed at first place. Having said, this can be improved by getting exposed to lot of other commands available in the Unix flavor in which you are working.

  Tip : Whenever you try with any command at the command prompt, be it for any small activity, always look for many different ways in which it can be achieved. This way you will get exposed to different options to achieve with which you can choose the right and the better option.

7. Achieve more with less: Say, you want to have 3 variables one containing the year, next containing the current month and the third containing the day. So, the commands for it could be:
$ year=`date '+%Y'`
$ mon=`date '+%m'`
$ dt=`date '+%d'`
 Well, we used 3 date commands which,actually, could very well have been done with a single date command.
$ DATE=`date '+%Y%m%d'`
$ year=${DATE:0:4}
$ mon=${DATE:4:2}
$ day=${DATE:6:2}
  See the difference!!! All we did is, in one date command, got the year, month and day. And used the shell substring function to extract them into different variables. Now, we have one external date command, and 3 internal shell substring operations. And imagine the performance improvement it brings when this is run many times in a script.

Tip: While using a command more than once, check for the possibility of getting the result with least number of commands.

8. Do not use ls always for file listing:  Keep in mind, ls is not the only command to list files, there are different  options to list the files. Just do a "echo *" at the prompt and see what happens!!!. When we want to process many files in a loop, say to process all the .txt files in a loop:
Option 1:
#!/usr/bin/sh

for file in `ls *.txt`
do
     echo $file
done
Option 2:
#!/usr/bin/sh

for file in  *.txt
do
     echo $file
done
   As you saw now, ls was not needed at all. Shell has many its own properties which we can discuss in some other thread. But, when it comes to using ls, use only if it really needed.

  Tip: Use ls only if needed.

9. sed/awk parses the entire file, by default: Say, you want to print the first 2 lines of a file using sed:
$ sed -n '1,2p' file
     This command starts reading the file. When the first line is read, it prints, and when second line is read, it prints. From 3rd line onwards, it reads, but does not print anything. The point here is: sed still reads the entire file. Assume a huge file with millions of records in it. For printing 2 lines, the command reads the entire file which is bad performance. .
   So, the solution for this is below:
$ sed  '2q' file
     This command on printing the 2nd line quits the file and comes out. The same is applicable for awk as well.     For the same requirement, if one has to do in awk using better performance:
$ awk 'NR==3{exit;}1' file
Tip: Always make sure you process only what is needed.

10. Using right conditions at right places in AND or OR: In shell scripting also, we can have logical AND and OR conditions. In an AND condition involving 2 conditions, if the first condition is not true, the second condition is not even evaluated since the result is anyway going to be false. Always make sure to put the condition with more failure chances as the first in the AND condition. In this way, you will avoid the second condition most of the times. In case of an OR condition, it is just the reverse.Always put the highest success possibility condition in the beginning.
Solution: Choose the appropriate positions for placing conditions.

4 comments: