Strings
Toit strings are immutable value objects, containing Unicode strings. The characters are encoded with UTF-8, although most of the time you can ignore this.
String literals
String literals use double quotes, "
. (Single quotes
are reserved for
character literals.)
If a string literal contains double quotes ("
), or backslashes (\
),
then these must be escaped with backslashes:
main: print "String with \"quotes\"." // >> String with "quotes". print "Zig zag string /\\/\\." // >> Zig zag string /\/\.
As an alternative, you can embed double quote characters using triple-quote delimited strings:
Escaped characters
String and character literals can also contain other escaped
characters, most of which could be written literally, but where it
is often more readable to use escapes. For example, you can
put a literal tab character in a literal string, but it will
look like a series of spaces, so you can use \t
instead to
make it clearer.
Toit supports the following escape sequences (see the ASCII control code chart for the list of names):
Control code chart names | Unicode character | Unicode code point (rune) | |
\0 | Null | NUL | 0 |
\a | Alert (Beep, Bell) | BEL | 7 |
\b | Backspace | BS | 8 |
\f | Form feed | FF | 12 |
\n | Line Feed or Newline | LF | 10 |
\r | Carriage Return | CR | 13 |
\t | Horizontal Tab | HT | 9 |
\v | Vertical Tab | VT | 11 |
\$ | $ (See below for the use of $ for string interpolation) | 36 | |
\" | " | 34 | |
\' | ' | 39 | |
\\ | \ | 92 | |
\uXXXX (where X is a hex char) | Unicode rune | ||
\u{X...} (where X... are hex chars) | Unicode rune | ||
\xXX (where X is a hex char) | Unicode rune | ||
\x{X...} (where X... are hex chars) | Unicode rune |
The following shows some examples of the numeric escapes:
// For two-digit code points (runes). // using \x, the curly braces are optional. s1 /string := "S\xf8en s\x{e5} s\u{00e6}r ud." print s1 // >> Søen så sær ud. // For four-digit values you can either use \x with curly // braces or \u. s2 /string := "\u{20ac} \u20ac \x{20ac}" print s2 // Prints "€ € €", which is Unicode 0x20ac. // The \u version can also be used with curly braces. // It is in fact necessary for 5-digit Unicode runes. s3 /string := "\u{1f648}" print s3 // Prints the see-no-evil-monkey emoji. // \x{} is recommended for one or two digits, and \u{} is // recommended for three to five digits. // The same escapes work in character literals. c /int := '\x2a' print c // Prints "42", which is the decimal from of 0x2a. c2 /int := '\'' print c2 // Prints "39", the ASCII code for a single quote.
See below for more advanced string literals.
String representation
Strings are represented internally as
UTF-8
byte sequences. This is visible when requesting the length,
which is given in bytes with the size
getter:
print "Hello!".size // Prints "6", the number of bytes. print "Hellö!".size // Prints "7", the number of bytes.
It is also visible when using []
to index into a string:
print "Hello!"[0] // Prints "72", the ASCII code for 'H'. print "Hellö!"[4] // Prints "246", the Unicode code point for 'ö'.
Since some UTF-8 sequences consist of multiple bytes, there
are some index points that do not correspond to a character.
At these points, the []
indexing operation returns null.
For example, the Euro sign is coded as three bytes:
print "€2".size // Prints "4", the number of bytes in the string. print "€2"[0] // Prints "8364", the Unicode code for '€'. print "€2"[1] // Prints "null", no character starts at byte index 1. print "€2"[2] // Prints "null", no character starts at byte index 2. print "€2"[3] // Prints "50", the ASCII code for '2'.
Since null is falsy we can index only the valid code points quite simply. The following program iterates over the 5-character, 6-byte string, printing 163, 50, 46, 50, 52:
STR ::= "£2.24" for i := 0; i < STR.size; i++: ch := STR[i] if ch: // Only handle non-null. print ch
In this case it would be simpler to use the do --runes method:
Toit does not allow the construction of illegal UTF-8 sequences. For example when converting byte arrays to strings:
// The following examples use the #[...] syntax for // byte array literals. // Create a single-character string containing an 'A': s1/string := #[65].to-string // Create a string with the word "Hello": s2/string := #[72, 101, 108, 108, 111].to-string // Null characters are allowed in strings: s3/string := #[0].to-string // Bytes above 127 are part of multi-byte UTF-8 sequences // and cannot stand alone: s4/string := #[128].to-string // Error: Not legal UTF-8! s5/string := #[0xc0, 0xbb].to-string // Error: Overlong encoding!
Surrogates are also not legal in Toit strings.
The raw UTF-8 bytes in a Toit string can be accessed with
at --raw
.
// Print all 10 UTF-8 bytes in the 8-character string. s6 := "RöckDöts" list := [] // Empty list. s6.size.repeat: | index | // Index ranges from 0 to 10. list.add (s6.at --raw index) print list // >> [82, 195, 182, 99, 107, 68, 195, 182, 116, 115]
Alternatively, the string's UTF-8 bytes can be written to a byte array with the to-byte-array or write-to-byte-array methods.
Slices of strings
Toit supports the [x..y]
notation for slices of strings.
This creates a new string consisting of bytes x to y, where
x is inclusive and y is exclusive:
hello ::= "Hello" print hello[2..4] // Prints "ll", characters 2 to 4 of "Hello". two-euro := "€2" print two-euro[0..2] // Error, this 2-byte slice cuts the '€' character! print two-euro[0..3] // Prints "€", these 3 bytes make up the '€' character.
The first index can be omitted, in which case the slice starts at the beginning of the source string. If the second index is omitted, the slice ends at the end of the source string.
hello-world := "Hello, World!" print hello-world[..5] // >> Hello print hello-world[7..] // >> World!
Multi-line strings
For strings that span multiple lines, newlines can be inserted
with \n
, but it is often clearer to use the triple-quotes
syntax, """..."""
.
The multi-line strings trim indentation-based white-space (see below). Therefore the above program will output the following:
Toit removes the first newline if it is immediately after the """
.
You should almost always put a newline after """
if you are using the
triple quotes for multi-line strings, since it makes the indentation rules
much easier to understand.
In multi-line strings there are some extra escapes that can be used.
\ followed by a newline (ASCII 10) | Removes a newline in a multi-line string. |
\ followed by a carriage return (ASCII 13) | Removes a newline in a multi-line string. If followed by a newline, that one is removed as well. |
\s | Space (see below for a detailed explanation). |
Toit removes the indentation of a multi-line string. It looks for the line with the smallest indentation and removes that amount of indentation for each line.
If a space (or several spaces) are needed in front of each line, use \s
to indicate to the compiler that this space shouldn't be seen as indentation.
main: str1 := """ foo""" str2 := """ \sfoo""" print "[" + str1 + "]" // >> [foo] print "[" + str2 + "]" // >> [ foo]
Note that multi-line strings frequently end with a newline anyway, in which case the last line can be used to indicate how much the indentation is:
main: print """ foo bar """ // <== indented less than the previous lines. As such // it gives the indentation for the other lines. print """ foo bar """
This results in the output:
Long string literals
Triple-quoting can also be used for very long strings that don't necessarily contain newlines. In this case you end each line with a backslash, which will cause the parser to ignore the newline at the end of the line.
You should start the string with a non-escaped newline that will be ignored in accordance with the above rules for triple quoted strings.
main: print """ This is a very long line that goes on \ and on for a very long time. It's easier \ to put any spaces at the end of the line, \ but you can put them at the start,\ making use of the usual\ indentation-removing rules \ for triple-quoted strings."""
This prints
This is a very long line that goes on and on for a very long time. It's easier to put any spaces at the end of the line, but you can put them at the start, making use of the usual indentation-removing rules for triple-quoted strings.
String interpolation
Using the dollar sign ('$') we can interpolate variables and other expressions into strings:
// Print: // Little pig 1 // Little pig 2 // Little pig 3 main: for i := 1; i <= 3; i++: print " Little pig $i"
The variable after the dollar sign is turned into a string
by calling its stringify
method. The variable stops at
the first non-alpha-numeric character, but the expression
can be extended using dot or []
notation:
main: i := 42 print "Level $i, the band." // >> Level 42, the band. j := -42 print "Level $j.abs, the band." // >> Level 42, the band. list := [0, 42, 103] print "Level $list[1], the band." // >> Level 42, the band.
If the variable is followed immediately by alphanumeric
characters or we want a more complicated interpolation
expression we can put parentheses, ()
, around an arbitrary
expression:
main: print "Level $(6 * 7), the band." // >> Level 42, the band. j := -42 print "Level $(-j), the band." // >> Level 42, the band. i := 42 print "Almost $(i)mm." // >> Almost 42mm.
The string interpolation feature also has support for some
printf-style formatting options. Unlike printf
the
formatting directives are directly adjacent to the value
they are formatting, so there is no need to count function
arguments:
import math show PI main: // %f can be used to specify the number of decimals for // floating point numbers: print "PI $(%.2f PI)" // >> 3.14 // %x is for hexadecimal numbers: print "In hex 0x$(%x 42)." // >> In hex 0x2a. // %02x pads a hex number with zeros: print "CR is 0x$(%02x 0xd)." // >> CR is 0x0d. // %4d pads a decimal number with spaces: // Prints: // > 0< // > 1< // > 8< // > 27< // > 64< // > 125< 6.repeat: print ">$(%4d it * it * it)<" // %c inserts a character by its Unicode code point (rune): // Prints: Life, the Universe and *. print "Life, the Universe and $(%c 42)." // %o writes in octal: print "The VAX answer: $(%03o 42)." // >> The VAX answer: 052. // %b writes in binary: print "The year $(%b 2010)." // >> The year 11111011010.
For other bases than 2, 8, 10, and 16, see stringify.
The full set of formatting options includes left-padding, and center-padding. See the documentation on the static string method, format.