3.7. Strings

Strings are null terminated sequences of bytes representing sequences of characters.

The usual ASCII characters are represented with a single bytes. Some characters are represented with multiple bytes. Most Lush functions deal with strings as sequences of bytes without regard to their character interpretation. Exceptions to this rule are indicated when appropriate.

The textual representation of a string is composed of the characters enclosed between double-quotes. A string may contain macro-characters, parentheses, semi-colons, as well as any other character. A line terminating backslash indicates a multi-line string.

The following ``C style'' escape sequences are recognized inside a string:

\\ for a single backslash,
\" for a double quote,
\n , \r , \t , \b , \f respectively for a linefeed character (Ascii LF), a carriage return (Ascii CR), a tab character (Ascii TAB), a backspace character (Ascii BS), and a formfeed character (Ascii FF),
\e for a end-of-file character (Stdio's EOF),
\^? for a control character control-? ,
\ooo for a byte whose octal representation is ooo .
\xhh for a byte whose hexadecimal representation is hh .
\uhhhh or \Uhhhhhh for the representation of unicode character hhhh or hhhhhh in the current locale. If no such representation exists, the utf8 representation is used.

3.7.0. Basic String Functions

Like most Lush functions, the basic functions operating on strings do not modify their arguments. They create instead a new string on the basis of their arguments.

See: (> n1 n2 )

3.7.0.0. (concat s1 ... sn)

[DX]

Concatenates strings s1 to sn .

Example:

? (concat "hello" " my friends")
= "hello my friends"

3.7.0.1. (len s)

[DX]

Returns the number of bytes in string s .

Example:

? (len "abcd")
= 4

3.7.0.2. (mid s n [l])

[DX]

Returns a substring of s composed of l bytes starting at byte position n . The position n is a number between 1 and the byte length of the string minus 1. When argument l is ommitted, function mid returns characters until the end of the string s .

Example:

? (mid "alphabet" 3 2)
= "ph"

? (mid "alphabet" 3)
= "phabet"

3.7.0.3. (right s n)

[DX]

Returns a string composed with n rightmost bytes of s .

Example:

? (right "alphabet" 3)
= "bet"

3.7.0.4. (left s n)

[DX]

Returns a string composed with the n leftmost bytes of s .

Example:

? (left "alphabet" 3)
= "alp"

3.7.0.5. (strins s1 n s2)

[DX]

Insert string s2 at byte n into the string s1 , and returns the result. When n is equal to 0, the strins function actually concatenates s2 and s1 .

Example:

? (strins "alphabet" 3 "***")
= "alp***habet"

3.7.0.6. (strdel s1 n l)

[DX]

Removes l bytes from string s1 starting at byte offset n .

Example:

? (strdel "alphabet" 3 2)
= "alabet"

3.7.0.7. (index s r [n])

[DX]

Searches the first occurrence of the string s in the string r , starting at byte position n . index returns the position of the first match. If such an occurrence cannot be found, it returns the empty list.

Example:

? (index "pha" "alpha alphabet alphabetical" 4)
= 9

3.7.0.8. (upcase s)

[DX]

Returns string s with all characters converted to uppercase according to the current locale.

Example:

? (upcase "alphabet")
= "ALPHABET"

3.7.0.9. (upcase1 s)

[DX]

Returns string s with first character converted to uppercase according to the current locale.

Example:

? (upcase1 "alphabet")
= "Alphabet"

3.7.0.10. (downcase s)

[DX]

Returns string s with all characters converted to lowercase according to the current locale.

Example:

? (downcase "alPHABet")
= "alphabet"

3.7.0.11. (val s)

[DX]

Returns the numerical value of s considered as a number. Returns the empty list if s does not represent a decimal or hexadecimal number.

Example:

? (val "3.14")
= 3.14

? (val "abcd")
= ()

? (val "0xABCD")
= 43981

3.7.0.12. (str n)

[DX]

Returns the decimal string representation of the number n .

Example:

? (str (2* 3.14))
= "6.28"

3.7.0.13. (strhex n)

[DX]

Returns the hexadecimal string representation of integer number n .

Example:

? (strhex 18)
= "0x12"

3.7.0.14. (strgptr p)

[DX]

Returns the hexadecimal string representation of pointer p preceded by an ampersand.

3.7.0.15. (asc s)

[DX]

Returns the value the first byte of string s . This function causes an error if s is an empty string.

Example

? (asc "abcd")
= 97

3.7.0.16. (chr n)

[DX]

Returns a string containing a single byte whose value is n . Integer n must be in range 0 to 255.

Example

? (chr 48)
= "0"

3.7.0.17. (isprint s)

[DX]

Returns t if string s contains only printable charactersa according to the current locale.

Example:

? (isprint "alpha bet")
= t

? (isprint "alpha\^Cbet")
= ()

3.7.0.18. (pname l)

[DX]

Returns a string representation for the lisp object l . pname is able to give a string representation for numbers, strings, symbols, lists, etc...

Example:

? (pname (cons 'a '(b c)))
= "(a b c)"

3.7.0.19. (sprintf format ... args ... )

[DX]

Like the C language function sprintf , this function returns a string similar to a format string format . The following escape sequences, however are replaced by a representation of the corresponding arguments of sprintf :

"%%" is replaced by a single \%.
"%l" is replaced by a representation of a lisp object.
"%[-][n]s" is replaced by a string, right justified in a field of length n if n is specified. When the optional minus sign is present, the string is left justified.
"%[-][n]d" is replaced by an integer, right justified in a field of n characters, if n is specified. When the optional minus sign is present, the string is left justified.
"%[-][n[.m]]c" where c is one of the characters e , f or g , is replaced by a floating point number in a n character field, with m digits after the decimal point. e specifies a format with an exponent, f specifies a format without an exponent, and g uses whichever format is more compact. When the optional minus sign is present, the string is left justified.

Example:

? (sprintf "%5s(%3d) is equal to %6.3f\n" "sqrt" 2 (sqrt 2))
= " sqrt(  2) is equal to  1.414\n"

3.7.0.20. (strip s)

[DE] (sysenv.lsh)

This function deletes the leftmost and rightmost spaces in string s .

(strip "  This sentences is full   of spaces.   ")

3.7.0.21. (stripl s)

[DE] (sysenv.lsh)

This function deletes the leftmost spaces in string s .

(stripl "  This sentences is full   of spaces.   ")

3.7.0.22. (stripr s)

[DE] (sysenv.lsh)

This function deletes the rightmost spaces in string s .

(stripr "  This sentences is full   of spaces.   ")

3.7.1. Regular Expressions (regex)

A regular expression describes a family of strings built according to the same pattern. A regular expression is represented by a string which ``matches'' (using certain conventions) any string in the family. TL provides four regular expression primitives ( regex-match , regex-extract , regex-seek , and regex-subst ) and several library functions.

The conventions for describing regular expressions in Lush are quite similar to those used by the egrep unix utility:

An ordinary character matches itself. Some characters, ( ) \ [ ] | . ? * and \ have a special meaning, and should be quoted by prepending a backslash \ . The string "\\\\" actually is composed of two backslashes (because backslashes in strings should be escaped!), and thus matches a single backslash.
A dot . matches any byte.
A caret ^ matches the beginning of the string.
A dollar sign $ matches the end of the string.
A range specification matches any specified byte. For example, regular expression [YyNn] matches Y y N or n , regular expression [0-9] matches any digit, regular expression [^0-9] matches any byte that is not a digit, regular expression []A-Za-z] matches a closing bracket, or any uppercase or lowercase letter.
The concatenation of two regular expressions matches the concatenation of two strings matches regular expression. Regular expressions can be grouped with parenthesis, and modified by the ? + and * characters.
A regular expression followed by a question mark ? matches 0 or 1 instance of the single regular expression.
A regular expression followed by a plus sign + matches 1 or more instances of the single regular expression.
A regular expression followed by a star * matches 0 or more instances of the single regular expression.
Finally, two regular expressions separated by a bar | match any string matching the first or the second regular expression.

Parenthesis can be used to group regular expressions. For instance, the regular expression "(+|-)?[0-9]+(\.[0-9]*)?" matches a signed number with an optional fractional part. Furthermore, there is a ``register'' associated with each parenthesized part of a regular expression. The matching routines use these registers to keep track of the characters matched by the corresponding part of the regular expression. This is useful with functions regex-extract and regex-subst .

3.7.1.0. (regex-match r s)

[DX]

Returns t if regular expression r exactly matches the entire string s . Returns the empty list otherwise.

Example:

? (regex-match "(+|-)?[0-9]+(\\.[0-9]*)?" "-56")
= t

3.7.1.1. (regex-extract r s)

[DX]

If regular expression r matches the entire string s , this function returns a list of strings representing the contents of each register, that is to say the characters matched by each section of the regular expression r delimited by parenthesis. This is useful for extracting specific segments of a string.

If the regular expression r does not match the string s , function regex-extract returns the empty list. If the regular expression r matches the string but does not contain parenthesis, this function retirns a list containing the initial string s .

Example:

? (regex-extract "(+|-)?([0-9]+)(\\.[0-9]*)?" "-56.23")
= ("-" "56" ".23")

3.7.1.2. (regex-seek r s [start])

[DX]

Searchs the first substring in s that matches the regular expression r , starting at position start in s . If the argument start is not provided, string s is searched from the beginning.

If such a substring is found, regex-seek returns a list (begin length) , where begin is the index of the first character of the substring, and length is the length of the subscript. The instruction (mid s begin length) may be used to extract this substring.

If no such substring exists, regex-seek returns the empty list.

Example:

? (regex-seek "(+|-)?[0-9]+(\\.[0-9]*)?," "a=56.2, b=57,")
= (3 5)

3.7.1.3. (regex-subst r s str)

[DX]

Replaces all substring matching regular expression r in string str by string s .

A ``register'' is associated to each piece of the regular expression r enclosed within parenthesis. Registers are numbered from %0 to %9 . During each match, the substring of str matching each piece of the regular expression is stored into the corresponding register.

During the replacement process, characters %0 to %9 in the replacement string s are substited the content of the corresponding register. (A single % is denoted as %% ).

Example:

? (regex-subst "([a-h])([1-8])" "%1%0" "e2-e4, d7-d5, d2-d4, d5xd4?")
= "2e-4e, 7d-5d, 2d-4d, 5dx4d?"

3.7.1.4. (regex-rseek r s [n [gr]])

[DE] (sysenv.lsh)

This function seeks recursively the first occurence of r in s . and returns the list made of the locations.

When argument n is provided, it seeks and returns the locations of the n first occurences and it returns () on failure.

Optional regex gr defines the allowed garbage stuff before and between occurences. When n is not provided, this function checks the garbage stuff after the occurences too. If unallowed garbage stuff is found, the function returns () . By default, any garbage stuff is allowed.

Since even void garbage is checked, a caret "^" is often added to gr .

3.7.1.5. (regex-split r s [n [gr [neg]]])

[DE] (sysenv.lsh)

This function splits a string s into occurences of r .

When integer n is provided, this function provides only the n first occurences.

When regex gr is provided, garbage is checked (see function regex-rseek ).

When neg is provided and non nil, this function returns the garbage stuff instead. When both n and neg are provided and non nil, the n garbages before and between the n first occurences are returned.

3.7.1.6. (regex-skip r s [n [gr [neg]]])

[DE] (sysenv.lsh)

This function skips the n first occurences of regex r in a string s .

When n is equal to 0, it returns s . When n is lower than 0, it generates an error. When n is either nil or undefined, it is set to 1.

When neg is either nil or undefined, it returns the right residual of s just following the n th occurence.

When neg is not nil, it returns the right residual of s begining with the n th occurence.

When regex gr is provided, garbage is checked (see function regex-rseek ).

3.7.1.7. (regex-count r s)

[DE] (sysenv.lsh)

This function recursively seeks the occurences of regex r in string s and returns the number of occurences found.

3.7.1.8. (regex-tail r s [n [gr [neg]]])

[DE] (sysenv.lsh)

This function seeks recursively the occurences of regex r in string s .

When neg is either nil or undefined, it returns the right residual of s begining before the n th last occurence.

When neg is non nil, it returns the right residual of s begining after the n th last occurence (and thus begining before the n th garbage.

When n is either nil or undefined, it is set to 1.

When regex gr is provided, garbage is checked (see function regex-rseek ).

3.7.1.9. (regex-member rl s)

[DE] (sysenv.lsh)

This function returns the first member of list rl which is a matching regex for string s .

3.7.2. International Strings

Lush contains partial support for multibyte strings using an encoding specified by the locale. This is work in progress.

3.7.2.0. (locale-to-utf8 s)

[DX]

Converts a string from locale encoding to UTF-8 encoding. This is a best effort function: The unmodified string is returned if the conversion is impossible, either because the string s is incorrect, or because the system does not provide suitable conversion facilities.

3.7.2.1. (utf8-to-locale-to s)

Converts a string from UTF-8 encoding to locale encoding. This is a best effort function: The unmodified string is returned if the conversion is impossible, either because the string s is incorrect, or because the system does not provide suitable conversion facilities.

3.7.2.2. (explode-chars s)

[DX]

Returns a list of integers with the wide character codes of all characters in the string. This function interprets multibyte sequences according to the encoding specified by the current locale.

Example (under a UTF8 locale):

? (explode-chars "\xe2\x82\xac")
= (8364)

3.7.2.3. (implode-chars l)

[DX]

Returns a string composed of the characters whose wide character code are specified by the list of integers l . Multibyte characters are generated according to the current locale. For instance, under a UTF8 locale,

Example

? (implode-chars '(8364 50 51 46 53 32 61 32 32 162 50 51 53 48))
= "€23.5 =  ¢2350"

3.7.2.4. (explode-bytes s)

[DX]

Returns a list of integers representing the sequence of bytes in string s , regardless of their character interpretation.

Example

? (explode-bytes "€")
= (226 130 172)

3.7.2.5. (implode-bytes l)

[DX]

Assemble a string composed of the bytes whose value is specified by the list of integers l , regardless of their multibyte representation.

Example

? (implode-bytes '(226 130 172 50 51))
= "€23"