Description
  • A mail consists of header fields (`headers' for short) and optionally followed by a message body.
  • Each header may contain one or more lines where the second and the following ones of a header must be indented by spaces or tabs.
  • The message body and the preceding header are separated by an empty line.
We want to extract every header matching a given regular expression, ^(received|subject): in this example.
Raw Input
From sed-users@yahoogroups.com  Sun May  9 14:52:11 2004
Return-Path: 
Received: from n11.grp.scd.yahoo.com (n11.grp.scd.yahoo.com [66.218.66.66])
	by main.rtfiber.com.tw (8.11.6/8.11.6) with SMTP id i170Lq809415
	for ; Sat, 7 Feb 2004 08:21:52 +0800
Received: (qmail 74534 invoked from network); 7 Feb 2004 00:21:52 -0000
To: sed-users@yahoogroups.com
Received: from unknown (HELO n17.grp.scd.yahoo.com) (66.218.66.72)
  by mta2.grp.scd.yahoo.com with SMTP; 7 Feb 2004 00:21:51 -0000
Subject: Hello!
From: "john_vdv" 
Welcome to the world of Regular Expressions!
Desired Output
Received: from n11.grp.scd.yahoo.com (n11.grp.scd.yahoo.com [66.218.66.66])
	by main.rtfiber.com.tw (8.11.6/8.11.6) with SMTP id i170Lq809415
	for ; Sat, 7 Feb 2004 08:21:52 +0800
Received: (qmail 74534 invoked from network); 7 Feb 2004 00:21:52 -0000
Subject: Hello!
Script and Comments
Script1
[ 1] :loop
[ 2] N
[ 3] /\n[ \t]+[^\n]*$/b loop
[ 4] h
[ 5] s/\n[^\n]*$//
[ 6] /^(received|subject): /Ip
[ 7] x
[ 8] s/^.*\n//
[ 9] /^$/q
[10] b loop
Comments
  1. Once the first line of some header has been read into PS, we have to check whether that header contains more lines. Therefore, a loop is required to read the following lines into PS until the first line of another header is reached. Steps [1] thru [3] constitute such a loop, where:
    • command `N' of Step [2] will append next line to PS.
    • If the line read by `N' begins with spaces or tabs, where PS matches \n[ ]+[^\n]*$, it is part of the current header. In this case, Step [3] makes sed jump to Step [1] to begin another iteration of the loop.
    • Otherwise,
      • PS now contains one complete header and the first line of another header. To perform operations on the recognized header, we have to remove the line of another header. But that line must be kept in some other place for further processing. Step [4] copies it accompanied with the current header to HS, and will be retrieved back later by Step [7].
      • Step [5] removes the first line of another header.
      • Now PS contains nothing of another header. To do a case-insensitive match, the GNU extension 'I' must be used in Step [6]. If the current header is matched the given RE, ^(received|subject): in this example, command `p' will print it.
  2. Then, we have to process the saved line.
    • Step [7] copies it accompanied with the previous (processed) header from HS back to PS, and Step [8] will remove the previous header.
    • If nothing survives, it indicates the empty line separating the header fields and the message body is reached. Since the message body is not interesting, Step [9] terminates sed.
    • Otherwise, Step [10] makes sed jump to Step [1].