Introduction: Basic understanding of XPath and its related concepts

Introduction to XPath:

Full form of XPath is XML Path. It is a query language designed to traverse through an xml document and select the required nodes using XPath Expressions and XPath functions, which I will discuss in the next chapters. XPath is a World Wide Web consortium (w3c) recommendation and the latest specification is Xpath 2.0.

This specification is designed to be referenced normatively from other specifications defining a host language for it. It is not intended to be implemented outside a host language. The implementation ability of this specification has been tested in the context of its normative inclusion in host languages defined by the XQuery and XSLT.

Xpath-Xquery-XSLT

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<dev:Corporation xmlns:dev="http://developprojects.com">
	<dev:Organization>
		<dev:website category="technical">
			<dev:name>developprojects.com</dev:name>
			<dev:topic>XPath Tutorial</dev:topic>
			<dev:author>Viswa Tej Swarup Reddy</dev:author>
			<dev:price>FREE</dev:price>
		</dev:website>
	</dev:Organization>
</dev:Corporation>

 

Regular XPath terminology:

The common XPath vocabularies you must know before proceeding further are:

  1. Nodes
  2. Atomic Values
  3. Items

Nodes:
Every XML document is a tree of nodes. Various types of nodes are

  1. Tag nodes
  2. Element nodes
  3. Attribute nodes
  4. Text nodes

In the above XML example,
< Corporation> is the root node.
<Organization> tag is a node.
<topic> XPath Tutorial</topic> is an element node.
category=”technical” is an attribute node.
XPath Tutorial is a text node.

And in turn these nodes can be categorized internally based on their relationships in between.

Some of them are namely

  1. Ancestor:  Ancestor nodes are parent nodes, if the parent nodes have parents, then we have to include their names too and it follows till the root node. In the above example the ancestor nodes of <author> node are <website>,<Organization>, <Corporation>.
  2.  Descendent: Descendent nodes are Child nodes, if the child nodes have children, then we have to include their names too and if follows till the lowest node is reached. In the above example the descendent nodes of <Organization> node are <website>, <name>, <topic>, <author>, <price>.
  3. Parent: Parent nodes are immediate parent of the selected node or attribute. In the above example <website> is the parent node for <name>, <topic>, <author>, <price>.
  4. Child:  Child nodes are the immediate child nodes of the selected nodes. In the above example, child nodes for <website> are <name>, <topic>, <author>, <price>.
  5. Siblings: The nodes which share a common parent are called siblings. In the above example, <name>, <topic>, <author>, <price> are siblings who share a common parent ,<website>.
  6. Text: Text nodes are just the values of the individual nodes.  In the above example, developprojects.com, XPath Tutorial, Viswa Tej Swarup Reddy, FREE are all text nodes.

It may sound redundant but we can also categorize node types based on the node’s functionality and purpose. They are:

  1. Document Node or Root Node
  2. Elements
  3. Attributes
  4. Comments
  5. Namespace
  6. Text
  7. Processing Instruction

 

  1. Document Node or Root Node: The top most element of an XML document is document node. All the other elements remain within the document node.In the Above example <Corporation> is the document node.
  2. Elements:  Elements are contained within the document node. Anything in an XML document with Opening and closing tag is called an Element.
  3. Attributes: Attributes describe the element and usually lies in the opening tag. Example: category=”technical” describes that the website element belong to the technical category. In case if we have two elements with same name, we can differentiate them using attribute node.
  1. Comments: Comments are text defined in XML document for describing things to other users. They are contained in between the <! — and  à tags.
  2. Namespace:  Different systems use different tag names for their nodes. There is always a possibility of name conflict. So, a better option is to use prefix before every tag name. This prefix will have to be defined using xmlns attribute. The syntax would be like <element xmlns:prefix=”URI”>.
  3. Processing Instruction: Processing Instruction is the most import element of an XML document in real time scenarios. It instructs the renderer which may be a browser or another application which uses this xml document, the encoding format and whether it is standalone or not. In our example, we have used standard web encoding=”UTF-8″ tells us that the encoding is a 8 bit Unicode format and standalone=”yes” tell us that the xml document is self-contained . The standalone declaration means that the document is self-contained. That in turn means one of three things:
    1. There is no DOCTYPE declaration in it.
    2. The DOCTYPE declaration is inline only.
    3. The DOCTYPE declaration is external or combined, but the external part contains no data that changes the infoset      representation of the document.
  4. Text: Text nodes are text defined in Element or Attribute or Processing instruction.Examples of text nodes are XPath Tutorial, technical.

Properties of Nodes:
The following are the various types of Properties that a node can have:

  1. Name: Name Property can be applied to Elements, attributes, namespaces, processing instructions and document node. A Name may contain a combination local name and prefix. We can extract those values using the standard functions. Namespace-uri()  function for accessing the prefix and local-name() function for local name.  Example: dev is prefix and Corporation is local name.
  2. String values: It refers to the values of the nodes. String() function helps us for getting the value of the node.
  3. Base URI
  4. Attributes
  5. Namespace
  6. Parent
  7. Children
  8. Type Annotation

Atomic values: Atomic values are nodes with no parent or child nodes. They are indeed similar to text nodes. developprojects.com, XPath Tutorial, Viswa Tej Swarup Reddy, FREE are all atomic values.

Items:Everything in XPath is indeed a sequence of items.  Items can be either nodes or atomic values. It is just similar to a node which may or may not contain parent or child nodes.

 

References:

http://www.w3.org/TR/2014/WD-xpath-31-20140424/

http://en.wikipedia.org/wiki/XPath_2.0

http://www.codingforums.com/xml/39839-xml-standalone-when-why.html

http://www.w3schools.com/XPath/xpath_nodes.asp

http://www.w3.org/TR/REC-xml/#vc-check-rmd