Description
Given a line consists of several TD elements, where
  • each TD element begins with a start tag <td ... > and
  • ends with a end tag </td>.
We want to extract the contents from the third TD elements with the `class' attribute assigned to `decimal'.
Raw Input
(the following data FIT in ONE LINE)
<tr class=c0><td class=left><img class=padleft src=America.gif title=USD>USD</td>
<td class=decimal>31.80000</td><td class=decimal>32.34200</td><td class=decimal>32.10000</td>
<td class=decimal>32.20000</td><td class=link colspan=1><a href=url1>Inquiry-1</a></td>
<td class=link colspan=1><a target='_blank' href=url_2>Inquiry-2</td>
Desired Output
32.10000
Script and Comments
Script1
[ 1] s/<td class=decimal>/&\n/3
[ 2] s/^[^\n]*\n//
[ 3] s/<\/td>.*$//
Comments
  1. Step [1] inserts a newline character between the start tag and the contents of the desired element.
  2. Step [2] deletes everything from the beginning of the line up to and including the newline character.
  3. Step [3] deletes everything from the end tag of the desired element till the end of the line.
  4. Someone may think that Steps [2] and [3] can be combined to
    s/^[^\n]*\n(.*)<\/td>.*$/\1/. But due to the greedy feature of *, the parentheses (.*) will catch strings as long as possible, in this case, it will take
    32.10000</td><td ...Inquiry-2.