Wednesday, November 7, 2012

How to retrieve or extract the tag value from XML in Linux?

How to fetch the tag value for a given tag from a simple XML file?
  Let us consider a simple XML file, cust.xml, with  customer details as below:
<?xml version="1.0" encoding="ISO-8859-1"?>
The requirement is to retrieve the tag value of "CustName" from the xml file. Let us see how to get this done in different ways:

1. Using the sed command:
$ sed -n '/CustName/{s/.*<CustName>//;s/<\/CustName.*//;p;}' cust.xml
Using sed substitution command, the pattern matching till the beginning of the opening tag is deleted. Another substitution is done to remove the pattern from the closing tag till the end. After this, only the tag value will be left in the pattern space.

2. Another sed way:
$ sed -n '/CustName/{s/.*<CustName>\(.*\)<\/CustName>.*/\1/;p}' cust.xml
The entire line is matched using the regular expression, however, only the value part is grouped. Hence, by using the backreference( \1) in the replacement part, only the value is obtained.

Using a variable:
$ x="CustName"
$ sed -n "/$x/{s/.*<$x>\(.*\)<\/$x>.*/\1/;p}" cust.xml
The only difference being the use of double-quotes. Since variable is used, double quotes are needed for the variables to get expanded.

 3. Using awk:
$ awk -F "[><]" '/CustName/{print $3}' cust.xml
Using multiple delimiter ( < and >) in awk, the special variable $3 will contain the value in our example. By filtering the data only for CustName, the tag value is retrieved.

 4. Using Perl:
$ perl -ne 'if (/CustName/){ s/.*?>//; s/<.*//;print;}' cust.xml
This perl solution is little similar to the sed. Using perl susbtitution, the pattern till the first ">" is removed, and then from the "<" till the end is removed. With this, only the tag value will be left with.

 5. Using the GNU grep command:
$ grep -oP '(?<=<CustName>).*(?=</CustName)' cust.xml
-o option is to only print the pattern matched instead of the entire line. -P option is for Perl like regular expressions. Using -P, we can do the perl kind of look ahead and look behind pattern matching in grep. This regular expression means: print the pattern which is preceeded(?<=) by "", and followed(?=) by "<CustName>".

Note: These above methods are useful only when the XML is a simple one. For complex XML's, XML parser modules present in Perl are highly useful which we will discuss in future.

No comments:

Post a Comment