Description
Given a file whose lines are of the form <a href=...> KEY </a>
For consecutive lines of the same key, retrieve only the first line.
Raw Input Desired Output
<a href="ind0001.html#i616">Blob</a>
<a href="ind0002.html#i5">Blob</a>
<a href="ind0004.html#i3546">Doe</a>
<a href="ind0003.html#i3556">Doe</a>
<a href="ind0001.html#k100">Newton</a>
<a href="ind0007.html#j331">Martin</a>
<a href="ind0009.html#j2479">Martin</a>
<a href="ind0008.html#l779">Martin</a>
<a href="ind0001.html#i616">Blob</a>
<a href="ind0004.html#i3546">Doe</a>
<a href="ind0001.html#k100">Newton</a>
<a href="ind0007.html#j331">Martin</a>
Script and Comments
Script1
[ 1] :loop
[ 2] $!N
[ 3] />([^<]*)<.*\n.*>\1<.*/!{
[ 4] P
[ 5] D
[ 6] }
[ 7] s/\n.*//
[ 8] b loop
Comments
  1. The `-r' option of GNU sed must be used or we have to escape the parentheses used in Step [3].
  2. The Pattern Space is abbreviated to `PS'.
  3. The script uses the following approach:
    • The line in PS is the first line of some key.
    • A loop is required to bypass the remaining lines of the same key.
    • The first line of some key will be printed only when the first line of another key is read or the end of the file is reached.
  4. Steps [1] thru [6] constitute the loop mentioned above:
    • Step [1] appends the next line to PS.
    • If the first line of another key is found, prints the first line of the key in question by Step [4], then deletes it and starts a new cycle by Step [5].
    • Otherwise, Step [7] deletes the line just read then Step [8] makes sed jump to Step [1].
Script2
[ 1] :loop
[ 2] $!N
[ 3] />([^<]*)<.*\n.*>\1<.*/s/\n.*//
[ 4] t loop
[ 5] P
[ 6] D
Comments
  1. A neat version.