Description
Given a datafile where
  • Each line is of the form Key = Value.
  • There may be several records with a given key
  • If there are records of the same key, for example, Key = Value1 Key = Value2 Key = Value3,
    we want to combine these records to Key = Value1, Value2, Value3 by listing these values in one line, separating them by commas. These values are listed in the same order as how they appear in that file.
Raw Input Desired Output
key1=a1
key3=c1
key2=b1
key1=a2
key2=b2
key4=d1
key4=d2
key3=c2
key2=b3
key3=c3
key1=a3
key4=d3
key1=a1,a2,a3
key3=c1,c2,c3
key2=b1,b2,b3
key4=d1,d2,d3
Script and Comments
Script1
[ 1] G
[ 2] s/^([^=]+=)([^\n]*)((\n[^\n]*)*\1[^\n]*)/\3,\2/
[ 3] t br0
[ 4] s/^([^\n]*)\n(.*)/\2\n\1/
[ 5] :br0
[ 6] s/^\n//
[ 7] $q
[ 8] x
[ 9] d
Comments -r
  1. This script use HS as a buffer to keep processed records, where two adjacent records are separated by a newline character.
  2. After a line has been read to PS, different actions will be performed depending on whether it is the first record of that key or not:
    • if it is the first record, append it to the end of the buffer.
    • otherwise, append the value of this record to the one with the same key in the buffer.
  3. Step [1] appends the contents of buffer (HS) to PS, separating the current record(line) from the buffer part with a newline character.
  4. If the key of the current record exists in the buffer part, PS looks like
    key_c = v_n \n key_1 = v1 \n key_2 = v2 ...\n key_c = v_c..., it can be interpreted as
    • the current record followed by zero or more occurrences of (newline, record) pair,
    • then a record with the same key.
    In this case, ^([^\n]+=)([^\n]*)((\n[^\n]*)*\1[^\n]*) matches, where
    • ([^\n]+=) matches the key of the current record,
    • ([^\n]*) matches the value of the current record,
    • (\n[^\n]*)* matches zero or more occurrences of (newline, record) pair.
    Steps involved are:
    • `s' of Step [2] will succeed,
    • `t' of Step [3] will make sed jump to Step [5].
    • Step [6] removes the newline character generated by Step [1].
  5. If this record is the first one in the file:
    • `s' of Step [2] will NOT succeed,
    • `t' of Step [3] will NOT be performed,
    • Step [4] moves the current record to the end of the buffer part.
    • Step [6] does no harm.
  6. In both cases,
    • After Step [6], PS contains the updated buffer.
    • If the current line is the last one of the datafile, Step [7] prints what we want and then sed terminates.
    • Otherwise, `x' of Step [8] exchanges PS and HS. After `x' command, HS contains the updated data.
    • Step [9] deletes the current line and start a new cycle.