A.4.11 String Encoding
{
AI05-0137-2}
Facilities for encoding, decoding, and converting strings in various
character encoding schemes are provided by packages Strings.UTF_Encoding,
Strings.UTF_Encoding.Conversions, Strings.UTF_Encoding.Strings, Strings.UTF_Encoding.Wide_Strings,
and Strings.UTF_Encoding.Wide_Wide_Strings.
Static Semantics
{
AI05-0137-2}
The encoding library packages have the following declarations:
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding
is
pragma Pure (UTF_Encoding);
--
Declarations common to the string encoding packages
type Encoding_Scheme
is (UTF_8, UTF_16BE, UTF_16LE);
subtype UTF_String
is String;
subtype UTF_8_String
is String;
subtype UTF_16_Wide_String
is Wide_String;
Encoding_Error :
exception;
BOM_8 :
constant UTF_8_String :=
Character'Val(16#EF#) &
Character'Val(16#BB#) &
Character'Val(16#BF#);
BOM_16BE :
constant UTF_String :=
Character'Val(16#FE#) &
Character'Val(16#FF#);
BOM_16LE :
constant UTF_String :=
Character'Val(16#FF#) &
Character'Val(16#FE#);
BOM_16 :
constant UTF_16_Wide_String :=
(1 => Wide_Character'Val(16#FEFF#));
function Encoding (Item : UTF_String;
Default : Encoding_Scheme := UTF_8)
return Encoding_Scheme;
end Ada.Strings.UTF_Encoding;
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding.Conversions
is
pragma Pure (Conversions);
--
Conversions between various encoding schemes
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Convert (Item : UTF_8_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Convert (Item : UTF_16_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
function Convert (Item : UTF_16_Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
end Ada.Strings.UTF_Encoding.Conversions;
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding.Strings
is
pragma Pure (Strings);
--
Encoding / decoding between String and various encoding schemes
function Encode (Item : String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
function Encode (Item : String;
Output_BOM : Boolean := False)
return UTF_8_String;
function Encode (Item : String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return String;
function Decode (Item : UTF_8_String)
return String;
function Decode (Item : UTF_16_Wide_String)
return String;
end Ada.Strings.UTF_Encoding.Strings;
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding.Wide_Strings
is
pragma Pure (Wide_Strings);
--
Encoding / decoding between Wide_String and various encoding schemes
function Encode (Item : Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
function Encode (Item : Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
function Encode (Item : Wide_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return Wide_String;
function Decode (Item : UTF_8_String)
return Wide_String;
function Decode (Item : UTF_16_Wide_String)
return Wide_String;
end Ada.Strings.UTF_Encoding.Wide_Strings;
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding.Wide_Wide_Strings
is
pragma Pure (Wide_Wide_Strings);
--
Encoding / decoding between Wide_Wide_String and various encoding schemes
function Encode (Item : Wide_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return Wide_Wide_String;
function Decode (Item : UTF_8_String)
return Wide_Wide_String;
function Decode (Item : UTF_16_Wide_String)
return Wide_Wide_String;
end Ada.Strings.UTF_Encoding.Wide_Wide_Strings;
{
AI05-0137-2}
{
AI05-0262-1}
The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds
to the UTF-8 encoding scheme defined by Annex D of ISO/IEC 10646. UTF_16BE
corresponds to the UTF-16 encoding scheme defined by Annex C of ISO/IEC
10646 in 8 bit, big-endian order; and UTF_16LE corresponds to the UTF-16
encoding scheme in 8 bit, little-endian order.
{
AI05-0137-2}
The subtype UTF_String is used to represent a String of 8-bit values
containing a sequence of values encoded in one of three ways (UTF-8,
UTF-16BE, or UTF-16LE). The subtype UTF_8_String is used to represent
a String of 8-bit values containing a sequence of values encoded in UTF-8.
The subtype UTF_16_Wide_String is used to represent a Wide_String of
16-bit values containing a sequence of values encoded in UTF-16.
{
AI05-0137-2}
{
AI05-0262-1}
The BOM_8, BOM_16BE, BOM_16LE, and BOM_16 constants correspond to values
used at the start of a string to indicate the encoding.
{
AI05-0262-1}
{
AI05-0269-1}
Each of the Encode functions takes a String, Wide_String, or Wide_Wide_String
Item parameter that is assumed to be an array of unencoded characters.
Each of the Convert functions takes a UTF_String, UTF_8_String, or UTF_16_String
Item parameter that is assumed to contain characters whose position values
correspond to a valid encoding sequence according to the encoding scheme
required by the function or specified by its Input_Scheme parameter.
{
AI05-0137-2}
{
AI05-0262-1}
{
AI05-0269-1}
Each of the Convert and Encode functions returns a UTF_String, UTF_8_String,
or UTF_16_String value whose characters have position values that correspond
to the encoding of the Item parameter according to the encoding scheme
required by the function or specified by its Output_Scheme parameter.
For UTF_8, no overlong encoding is returned. A BOM is included at the
start of the returned string if the Output_BOM parameter is set to True.
The lower bound of the returned string is 1.
{
AI05-0137-2}
{
AI05-0262-1}
Each of the Decode functions takes a UTF_String, UTF_8_String, or UTF_16_String
Item parameter which is assumed to contain characters whose position
values correspond to a valid encoding sequence according to the encoding
scheme required by the function or specified by its Input_Scheme parameter,
and returns the corresponding String, Wide_String, or Wide_Wide_String
value. The lower bound of the returned string is 1.
{
AI05-0137-2}
{
AI05-0262-1}
For each of the Convert and Decode functions, an initial BOM in the input
that matches the expected encoding scheme is ignored, and a different
initial BOM causes Encoding_Error to be propagated.
{
AI05-0137-2}
The exception Encoding_Error is also propagated in the following situations:
By a Decode function when a UTF encoded string
contains an invalid encoding sequence.
By a Decode function when the expected encoding
is UTF-16BE or UTF-16LE and the input string has an odd length.
{
AI05-0262-1}
By a Decode function yielding a String when the decoding of a sequence
results in a code point whose value exceeds 16#FF#.
By a Decode function yielding a Wide_String when
the decoding of a sequence results in a code point whose value exceeds
16#FFFF#.
{
AI05-0262-1}
By an Encode function taking a Wide_String as input when an invalid character
appears in the input. In particular, the characters whose position is
in the range 16#D800# .. 16#DFFF# are invalid because they conflict with
UTF-16 surrogate encodings, and the characters whose position is 16#FFFE#
or 16#FFFF# are also invalid because they conflict with BOM codes.
{
AI05-0137-2}
function Encoding (Item : UTF_String;
Default : Encoding_Scheme := UTF_8)
return Encoding_Scheme;
{
AI05-0137-2}
{
AI05-0269-1}
Inspects a UTF_String value to determine whether it starts with a BOM
for UTF-8, UTF-16BE, or UTF_16LE. If so, returns the scheme corresponding
to the BOM; otherwise, returns the value of Default.
{
AI05-0137-2}
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
Returns the value
of Item (originally encoded in UTF-8, UTF-16LE, or UTF-16BE as specified
by Input_Scheme) encoded in one of these three schemes as specified by
Output_Scheme.
{
AI05-0137-2}
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
Returns the value
of Item (originally encoded in UTF-8, UTF-16LE, or UTF-16BE as specified
by Input_Scheme) encoded in UTF-16.
{
AI05-0137-2}
function Convert (Item : UTF_8_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
Returns the value
of Item (originally encoded in UTF-8) encoded in UTF-16.
{
AI05-0137-2}
function Convert (Item : UTF_16_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
Returns the value
of Item (originally encoded in UTF-16) encoded in UTF-8, UTF-16LE, or
UTF-16BE as specified by Output_Scheme.
{
AI05-0137-2}
function Convert (Item : UTF_16_Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
Returns the value
of Item (originally encoded in UTF-16) encoded in UTF-8.
{
AI05-0137-2}
function Encode (Item : String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
{
AI05-0262-1}
Returns the value of Item encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Output_Scheme.
{
AI05-0137-2}
function Encode (Item : String;
Output_BOM : Boolean := False)
return UTF_8_String;
Returns the value
of Item encoded in UTF-8.
{
AI05-0137-2}
function Encode (Item : String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
Returns the value
of Item encoded in UTF_16.
{
AI05-0137-2}
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return String;
Returns the result
of decoding Item, which is encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Input_Scheme.
{
AI05-0137-2}
function Decode (Item : UTF_8_String)
return String;
Returns the result
of decoding Item, which is encoded in UTF-8.
{
AI05-0137-2}
function Decode (Item : UTF_16_Wide_String)
return String;
Returns the result
of decoding Item, which is encoded in UTF-16.
{
AI05-0137-2}
function Encode (Item : Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
{
AI05-0262-1}
Returns the value of Item encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Output_Scheme.
{
AI05-0137-2}
function Encode (Item : Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
Returns the value
of Item encoded in UTF-8.
{
AI05-0137-2}
function Encode (Item : Wide_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
Returns the value
of Item encoded in UTF_16.
{
AI05-0137-2}
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return Wide_String;
Returns the result
of decoding Item, which is encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Input_Scheme.
{
AI05-0137-2}
function Decode (Item : UTF_8_String)
return Wide_String;
Returns the result
of decoding Item, which is encoded in UTF-8.
{
AI05-0137-2}
function Decode (Item : UTF_16_Wide_String)
return Wide_String;
Returns the result
of decoding Item, which is encoded in UTF-16.
{
AI05-0137-2}
function Encode (Item : Wide_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
{
AI05-0262-1}
Returns the value of Item encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Output_Scheme.
{
AI05-0137-2}
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
Returns the value
of Item encoded in UTF-8.
{
AI05-0137-2}
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
Returns the value
of Item encoded in UTF_16.
{
AI05-0137-2}
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return Wide_Wide_String;
Returns the result
of decoding Item, which is encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Input_Scheme.
{
AI05-0137-2}
function Decode (Item : UTF_8_String)
return Wide_Wide_String;
Returns the result
of decoding Item, which is encoded in UTF-8.
{
AI05-0137-2}
function Decode (Item : UTF_16_Wide_String)
return Wide_Wide_String;
Returns the result
of decoding Item, which is encoded in UTF-16.
Implementation Advice
{
AI05-0137-2}
If an implementation supports other encoding schemes, another similar
child of Ada.Strings should be defined.
Implementation Advice: If an implementation
supports other string encoding schemes, a child of Ada.Strings similar
to UTF_Encoding should be defined.
18 {
AI05-0137-2}
A BOM (Byte-Order Mark, code position 16#FEFF#) can be included in a
file or other entity to indicate the encoding; it is skipped when decoding.
Typically, only the first line of a file or other entity contains a BOM.
When decoding, the Encoding function can be called on the first line
to determine the encoding; this encoding will then be used in subsequent
calls to Decode to convert all of the lines to an internal format.
Extensions to Ada 2005
{
AI05-0137-2}
The packages Strings.UTF_Encoding, Strings.UTF_Encoding.Conversions,
Strings.UTF_Encoding.Strings, Strings.UTF_Encoding.Wide_Strings, and
Strings.UTF_Encoding.Wide_Wide_Strings are new.
Ada 2005 and 2012 Editions sponsored in part by Ada-Europe