xref: /illumos-gate/usr/src/cmd/sed/POSIX (revision 84441f85)
1*84441f85SGarrett D'Amore#	@(#)POSIX	8.1 (Berkeley) 6/6/93
2*84441f85SGarrett D'Amore# $FreeBSD$
3*84441f85SGarrett D'Amore
4*84441f85SGarrett D'AmoreComments on the IEEE P1003.2 Draft 12
5*84441f85SGarrett D'Amore     Part 2: Shell and Utilities
6*84441f85SGarrett D'Amore  Section 4.55: sed - Stream editor
7*84441f85SGarrett D'Amore
8*84441f85SGarrett D'AmoreDiomidis Spinellis <dds@doc.ic.ac.uk>
9*84441f85SGarrett D'AmoreKeith Bostic <bostic@cs.berkeley.edu>
10*84441f85SGarrett D'Amore
11*84441f85SGarrett D'AmoreIn the following paragraphs, "wrong" usually means "inconsistent with
12*84441f85SGarrett D'Amorehistoric practice", as most of the following comments refer to
13*84441f85SGarrett D'Amoreundocumented inconsistencies between the historical versions of sed and
14*84441f85SGarrett D'Amorethe POSIX 1003.2 standard.  All the comments are notes taken while
15*84441f85SGarrett D'Amoreimplementing a POSIX-compatible version of sed, and should not be
16*84441f85SGarrett D'Amoreinterpreted as official opinions or criticism towards the POSIX committee.
17*84441f85SGarrett D'AmoreAll uses of "POSIX" refer to section 4.55, Draft 12 of POSIX 1003.2.
18*84441f85SGarrett D'Amore
19*84441f85SGarrett D'Amore 1.	32V and BSD derived implementations of sed strip the text
20*84441f85SGarrett D'Amore	arguments of the a, c and i commands of their initial blanks,
21*84441f85SGarrett D'Amore	i.e.
22*84441f85SGarrett D'Amore
23*84441f85SGarrett D'Amore	#!/bin/sed -f
24*84441f85SGarrett D'Amore	a\
25*84441f85SGarrett D'Amore		foo\
26*84441f85SGarrett D'Amore		\  indent\
27*84441f85SGarrett D'Amore		bar
28*84441f85SGarrett D'Amore
29*84441f85SGarrett D'Amore	produces:
30*84441f85SGarrett D'Amore
31*84441f85SGarrett D'Amore	foo
32*84441f85SGarrett D'Amore	  indent
33*84441f85SGarrett D'Amore	bar
34*84441f85SGarrett D'Amore
35*84441f85SGarrett D'Amore	POSIX does not specify this behavior as the System V versions of
36*84441f85SGarrett D'Amore	sed do not do this stripping.  The argument against stripping is
37*84441f85SGarrett D'Amore	that it is difficult to write sed scripts that have leading blanks
38*84441f85SGarrett D'Amore	if they are stripped.  The argument for stripping is that it is
39*84441f85SGarrett D'Amore	difficult to write readable sed scripts unless indentation is allowed
40*84441f85SGarrett D'Amore	and ignored, and leading whitespace is obtainable by entering a
41*84441f85SGarrett D'Amore	backslash in front of it.  This implementation follows the BSD
42*84441f85SGarrett D'Amore	historic practice.
43*84441f85SGarrett D'Amore
44*84441f85SGarrett D'Amore 2.	Historical versions of sed required that the w flag be the last
45*84441f85SGarrett D'Amore	flag to an s command as it takes an additional argument.  This
46*84441f85SGarrett D'Amore	is obvious, but not specified in POSIX.
47*84441f85SGarrett D'Amore
48*84441f85SGarrett D'Amore 3.	Historical versions of sed required that whitespace follow a w
49*84441f85SGarrett D'Amore	flag to an s command.  This is not specified in POSIX.  This
50*84441f85SGarrett D'Amore	implementation permits whitespace but does not require it.
51*84441f85SGarrett D'Amore
52*84441f85SGarrett D'Amore 4.	Historical versions of sed permitted any number of whitespace
53*84441f85SGarrett D'Amore	characters to follow the w command.  This is not specified in
54*84441f85SGarrett D'Amore	POSIX.  This implementation permits whitespace but does not
55*84441f85SGarrett D'Amore	require it.
56*84441f85SGarrett D'Amore
57*84441f85SGarrett D'Amore 5.	The rule for the l command differs from historic practice.  Table
58*84441f85SGarrett D'Amore	2-15 includes the various ANSI C escape sequences, including \\
59*84441f85SGarrett D'Amore	for backslash.  Some historical versions of sed displayed two
60*84441f85SGarrett D'Amore	digit octal numbers, too, not three as specified by POSIX.  POSIX
61*84441f85SGarrett D'Amore	is a cleanup, and is followed by this implementation.
62*84441f85SGarrett D'Amore
63*84441f85SGarrett D'Amore 6.	The POSIX specification for ! does not specify that for a single
64*84441f85SGarrett D'Amore	command the command must not contain an address specification
65*84441f85SGarrett D'Amore	whereas the command list can contain address specifications.  The
66*84441f85SGarrett D'Amore	specification for ! implies that "3!/hello/p" works, and it never
67*84441f85SGarrett D'Amore	has, historically.  Note,
68*84441f85SGarrett D'Amore
69*84441f85SGarrett D'Amore		3!{
70*84441f85SGarrett D'Amore			/hello/p
71*84441f85SGarrett D'Amore		}
72*84441f85SGarrett D'Amore
73*84441f85SGarrett D'Amore	does work.
74*84441f85SGarrett D'Amore
75*84441f85SGarrett D'Amore 7.	POSIX does not specify what happens with consecutive ! commands
76*84441f85SGarrett D'Amore	(e.g. /foo/!!!p).  Historic implementations allow any number of
77*84441f85SGarrett D'Amore	!'s without changing the behaviour.  (It seems logical that each
78*84441f85SGarrett D'Amore	one might reverse the behaviour.)  This implementation follows
79*84441f85SGarrett D'Amore	historic practice.
80*84441f85SGarrett D'Amore
81*84441f85SGarrett D'Amore 8.	Historic versions of sed permitted commands to be separated
82*84441f85SGarrett D'Amore	by semi-colons, e.g. 'sed -ne '1p;2p;3q' printed the first
83*84441f85SGarrett D'Amore	three lines of a file.  This is not specified by POSIX.
84*84441f85SGarrett D'Amore	Note, the ; command separator is not allowed for the commands
85*84441f85SGarrett D'Amore	a, c, i, w, r, :, b, t, # and at the end of a w flag in the s
86*84441f85SGarrett D'Amore	command.  This implementation follows historic practice and
87*84441f85SGarrett D'Amore	implements the ; separator.
88*84441f85SGarrett D'Amore
89*84441f85SGarrett D'Amore 9.	Historic versions of sed terminated the script if EOF was reached
90*84441f85SGarrett D'Amore	during the execution of the 'n' command, i.e.:
91*84441f85SGarrett D'Amore
92*84441f85SGarrett D'Amore	sed -e '
93*84441f85SGarrett D'Amore	n
94*84441f85SGarrett D'Amore	i\
95*84441f85SGarrett D'Amore	hello
96*84441f85SGarrett D'Amore	' </dev/null
97*84441f85SGarrett D'Amore
98*84441f85SGarrett D'Amore	did not produce any output.  POSIX does not specify this behavior.
99*84441f85SGarrett D'Amore	This implementation follows historic practice.
100*84441f85SGarrett D'Amore
101*84441f85SGarrett D'Amore10.	Deleted.
102*84441f85SGarrett D'Amore
103*84441f85SGarrett D'Amore11.	Historical implementations do not output the change text of a c
104*84441f85SGarrett D'Amore	command in the case of an address range whose first line number
105*84441f85SGarrett D'Amore	is greater than the second (e.g. 3,1).  POSIX requires that the
106*84441f85SGarrett D'Amore	text be output.  Since the historic behavior doesn't seem to have
107*84441f85SGarrett D'Amore	any particular purpose, this implementation follows the POSIX
108*84441f85SGarrett D'Amore	behavior.
109*84441f85SGarrett D'Amore
110*84441f85SGarrett D'Amore12.	POSIX does not specify whether address ranges are checked and
111*84441f85SGarrett D'Amore	reset if a command is not executed due to a jump.  The following
112*84441f85SGarrett D'Amore	program will behave in different ways depending on whether the
113*84441f85SGarrett D'Amore	'c' command is triggered at the third line, i.e. will the text
114*84441f85SGarrett D'Amore	be output even though line 3 of the input will never logically
115*84441f85SGarrett D'Amore	encounter that command.
116*84441f85SGarrett D'Amore
117*84441f85SGarrett D'Amore	2,4b
118*84441f85SGarrett D'Amore	1,3c\
119*84441f85SGarrett D'Amore		text
120*84441f85SGarrett D'Amore
121*84441f85SGarrett D'Amore	Historic implementations did not output the text in the above
122*84441f85SGarrett D'Amore	example.  Therefore it was believed that a range whose second
123*84441f85SGarrett D'Amore	address was never matched extended to the end of the input.
124*84441f85SGarrett D'Amore	However, the current practice adopted by this implementation,
125*84441f85SGarrett D'Amore	as well as by those from GNU and SUN, is as follows:  The text
126*84441f85SGarrett D'Amore	from the 'c' command still isn't output because the second address
127*84441f85SGarrett D'Amore	isn't actually matched; but the range is reset after all if its
128*84441f85SGarrett D'Amore	second address is a line number.  In the above example, only the
129*84441f85SGarrett D'Amore	first line of the input will be deleted.
130*84441f85SGarrett D'Amore
131*84441f85SGarrett D'Amore13.	Historical implementations allow an output suppressing #n at the
132*84441f85SGarrett D'Amore	beginning of -e arguments as well as in a script file.  POSIX
133*84441f85SGarrett D'Amore	does not specify this.  This implementation follows historical
134*84441f85SGarrett D'Amore	practice.
135*84441f85SGarrett D'Amore
136*84441f85SGarrett D'Amore14.	POSIX does not explicitly specify how sed behaves if no script is
137*84441f85SGarrett D'Amore	specified.  Since the sed Synopsis permits this form of the command,
138*84441f85SGarrett D'Amore	and the language in the Description section states that the input
139*84441f85SGarrett D'Amore	is output, it seems reasonable that it behave like the cat(1)
140*84441f85SGarrett D'Amore	command.  Historic sed implementations behave differently for "ls |
141*84441f85SGarrett D'Amore	sed", where they produce no output, and "ls | sed -e#", where they
142*84441f85SGarrett D'Amore	behave like cat.  This implementation behaves like cat in both cases.
143*84441f85SGarrett D'Amore
144*84441f85SGarrett D'Amore15.	The POSIX requirement to open all w files at the beginning makes
145*84441f85SGarrett D'Amore	sed behave nonintuitively when the w commands are preceded by
146*84441f85SGarrett D'Amore	addresses or are within conditional blocks.  This implementation
147*84441f85SGarrett D'Amore	follows historic practice and POSIX, by default, and provides the
148*84441f85SGarrett D'Amore	-a option which opens the files only when they are needed.
149*84441f85SGarrett D'Amore
150*84441f85SGarrett D'Amore16.	POSIX does not specify how escape sequences other than \n and \D
151*84441f85SGarrett D'Amore	(where D is the delimiter character) are to be treated.  This is
152*84441f85SGarrett D'Amore	reasonable, however, it also doesn't state that the backslash is
153*84441f85SGarrett D'Amore	to be discarded from the output regardless.  A strict reading of
154*84441f85SGarrett D'Amore	POSIX would be that "echo xyz | sed s/./\a" would display "\ayz".
155*84441f85SGarrett D'Amore	As historic sed implementations always discarded the backslash,
156*84441f85SGarrett D'Amore	this implementation does as well.
157*84441f85SGarrett D'Amore
158*84441f85SGarrett D'Amore17.	POSIX specifies that an address can be "empty".  This implies
159*84441f85SGarrett D'Amore	that constructs like ",d" or "1,d" and ",5d" are allowed.  This
160*84441f85SGarrett D'Amore	is not true for historic implementations or this implementation
161*84441f85SGarrett D'Amore	of sed.
162*84441f85SGarrett D'Amore
163*84441f85SGarrett D'Amore18.	The b t and : commands are documented in POSIX to ignore leading
164*84441f85SGarrett D'Amore	white space, but no mention is made of trailing white space.
165*84441f85SGarrett D'Amore	Historic implementations of sed assigned different locations to
166*84441f85SGarrett D'Amore	the labels "x" and "x ".  This is not useful, and leads to subtle
167*84441f85SGarrett D'Amore	programming errors, but it is historic practice and changing it
168*84441f85SGarrett D'Amore	could theoretically break working scripts.  This implementation
169*84441f85SGarrett D'Amore	follows historic practice.
170*84441f85SGarrett D'Amore
171*84441f85SGarrett D'Amore19.	Although POSIX specifies that reading from files that do not exist
172*84441f85SGarrett D'Amore	from within the script must not terminate the script, it does not
173*84441f85SGarrett D'Amore	specify what happens if a write command fails.  Historic practice
174*84441f85SGarrett D'Amore	is to fail immediately if the file cannot be opened or written.
175*84441f85SGarrett D'Amore	This implementation follows historic practice.
176*84441f85SGarrett D'Amore
177*84441f85SGarrett D'Amore20.	Historic practice is that the \n construct can be used for either
178*84441f85SGarrett D'Amore	string1 or string2 of the y command.  This is not specified by
179*84441f85SGarrett D'Amore	POSIX.  This implementation follows historic practice.
180*84441f85SGarrett D'Amore
181*84441f85SGarrett D'Amore21.	Deleted.
182*84441f85SGarrett D'Amore
183*84441f85SGarrett D'Amore22.	Historic implementations of sed ignore the RE delimiter characters
184*84441f85SGarrett D'Amore	within character classes.  This is not specified in POSIX.  This
185*84441f85SGarrett D'Amore	implementation follows historic practice.
186*84441f85SGarrett D'Amore
187*84441f85SGarrett D'Amore23.	Historic implementations handle empty RE's in a special way: the
188*84441f85SGarrett D'Amore	empty RE is interpreted as if it were the last RE encountered,
189*84441f85SGarrett D'Amore	whether in an address or elsewhere.  POSIX does not document this
190*84441f85SGarrett D'Amore	behavior.  For example the command:
191*84441f85SGarrett D'Amore
192*84441f85SGarrett D'Amore		sed -e /abc/s//XXX/
193*84441f85SGarrett D'Amore
194*84441f85SGarrett D'Amore	substitutes XXX for the pattern abc.  The semantics of "the last
195*84441f85SGarrett D'Amore	RE" can be defined in two different ways:
196*84441f85SGarrett D'Amore
197*84441f85SGarrett D'Amore	1. The last RE encountered when compiling (lexical/static scope).
198*84441f85SGarrett D'Amore	2. The last RE encountered while running (dynamic scope).
199*84441f85SGarrett D'Amore
200*84441f85SGarrett D'Amore	While many historical implementations fail on programs depending
201*84441f85SGarrett D'Amore	on scope differences, the SunOS version exhibited dynamic scope
202*84441f85SGarrett D'Amore	behaviour.  This implementation does dynamic scoping, as this seems
203*84441f85SGarrett D'Amore	the most useful and in order to remain consistent with historical
204*84441f85SGarrett D'Amore	practice.
205