Class \Scrivo\String

Wrapper class for PHP strings to enforce consistent and safe multi-byte (UTF-8) string handling.

\Scrivo\String is a primitive wrapper class for PHP strings to make sure that all operations performed on the string are UTF-8 safe. As PHP does not enforce a consistent way to deal with multibyte strings we do it ourselves. In the Scrivo code base UTF-8 is the only encoding that is supported for operations on data and these operations should be done through instances of the \Scrivo\String class. If strings are used as byte arrays, use the ByteArray class.

\Scrivo\String objects are imutable: once created you can't change them. All operations on a \Scrivo\String object will return a new \Scrivo\String object.

Although we'll be working with UTF-8 exclusively it is possible to create \Scrivo\String objects that contain characters from 8 byte encoding schemes. Also a note on HTML entities, we work with UTF-8 so you don't need them: they are evil. Except entities for the reserved HTML characters (<>&'") there is really no use for them in UTF-8 strings. And when stored in a database only cause sorting and lookup errors. Therefore when construction \Scrivo\String objects you can opt to convert existing HTML entities to their corresonding UTF-8 characters.

The current locale setting for LC_COLLATE is important. \Scrivo\String::compareTo() will use this setting when comparing strings.

Please note: you might be tempted to do string comparison using equality operators (==). Although this works in most cases don't do this: you'll do PHP object comparison (i.e. comparing a \Scrivo\String object) and that is not what you want: use \Scrivo\String::equals() or \Scrivo\String::compareTo() to compare strings.

Implements
Defined in: String.php.


Constructor summary

Attr. Name / Description
public

String($str, $toDecode, $encoding)

Construct an \Scrivo\String.

Constant summary

Name Description
DECODE_ALL Constant to indicate that you want to decode all entities when constructing the string.
DECODE_NONE Constant to indicate that you don't want to decode any entities when constructing the string.
DECODE_UNRESERVED Constant to indicate that you want to decode all but the entities for reserved characters (&<>'") when constructing the string.
ENC_CP_1251 Constant to denote CP-1251 encoding.
ENC_ISO_8859_1 Constant to denote ISO-8859-1 encoding.

Member summary

Attr. Type Name Description
private static Collator $coll Collator used for sorting.
private int $len The length of the string (characters not bytes).
private static array[] $maps Map to translate 8 byte code page characters to UTF-8 sequences.
private string $pos The current position when iterating.
private string $str The primitive UTF-8 string.

Method summary

Attr. Type Name / Description
public mixed

__get($name)

Implementation of the readable properties using the PHP magic method __get().

public string

__toString()

Return the primitive UTF-8 string for this instance.

public int

compareTo($str)

Compare this string to another \Scrivo\String object.

public boolean

contains($str, $offset, $ignoreCase)

Check if the string contains the given substring.

public int

count()

Return the character count of the string.

public static \..\String

create($str, $toDecode, $encoding)

Factory method to construct an \Scrivo\String.

public string

current()

Return the current UTF-8 character when iterating.

public boolean

equals($str)

Test if this string equals another \Scrivo\String object.

public \..\String

firstOccurranceOf($str, $part, $ignoreCase)

Returns the first occurance of a given substring in this string.

private string

fixCodePageString($str, $encoding)

Convert a string with UTF-8 and code page characters to a valid UTF-8 string.

private string

fixString($str, $toDecode, $encoding)

Convert a string with HTML entities, UTF-8 and code page characters to a valid UTF-8 string.

public static Collator

getCollator()

Get the collator for sorting strings.

public int

getLength()

Get the length of the string.

public mixed

inArray($arr)

Check if this string exists an array of \Scrivo\String-s.

public int

indexOf($str, $offset, $ignoreCase)

Returns the index of the given substring in this string.

private int

isUtf8Sequence($seq)

Test if a given byte sequence is a valid UTF-8 sequence.

public int

key()

Return the index of the current UTF-8 character when iterating.

public int

lastIndexOf($str, $offset, $ignoreCase)

Returns the index of the last occurance of the given substring in this string.

public \..\String

lastOccurranceOf($str, $part, $ignoreCase)

Returns the last occurance of a given character in this string.

public

next()

Move forward in this string to the next UTF-8 character when iterating.

public boolean

offsetExists($offset)

Check if the specified index location in this string is valid.

public

offsetGet($offset)

Get an UTF-8 character from a string using array brackets.

public

offsetSet($offset, $value)

Illegal method: set a character at a specified index location.

public

offsetUnset($offset)

Illegal method: unset a character at a specified index location.

public \..\String

replace($from, $to)

Replace a substring or set of substrings in this string.

public

rewind()

Reset the current character index so iterating will (re)start at the beginning of this string.

public static

setCollator($coll)

Set the collator for sorting strings.

public \..\String[]

split($delimiter, $limit)

Split this string using a delimiter.

public \..\String

substr($start, $length)

Get a substring from a string using an offset and a length.

public \..\String

substring($start, $end)

Get a substring from a string using a start and end index.

public \..\String

toLowerCase()

Get a copy of this string with all of its characters converted to lower case.

public \..\String

toUpperCase()

Get a copy of this string with all of its characters converted to upper case.

public \..\String

trim()

Get a trimmed copy of this string.

private \..\String

unsafeSubstr($start, $length)

Get a substring from a string without first checking the boundaries.

public boolean

valid()

Check if the current character index for iterating is valid.

 


Constructor

public String(string $str="", int $toDecode=self::DECODE_NONE, string $encoding="UTF-8")

Construct an \Scrivo\String.

You can either construct an \Scrivo\String object from a valid UTF-8 string, or from a string that you expect not to contain valid UTF-8 data. In the latter case use the $toDecode and/or $encoding parameters.

Possible choices for $toDecode are:

  • Utf8string::DECODE_NONE don't decode HTML entities
  • Utf8string::DECODE_ALL, decode all HTML entities;
  • Utf8string::DECODE_UNRESERVED, decode all but the HTML entities for <>&' and ' (HTML/XML)

If you expect that the source string contains 8 byte code page character then you can select the encoding to use to convert them to their corresponding UTF-8 characters. Supported encodings are:

  • Utf8string::ENC_ISO_8859_1
  • Utf8string::ENC_CP_1251

Note: typical use of the $toDecode and $encoding parameters is when you want to 'sanitize' data before you store it into a database. Setting these parameters start CPU intensive procedures so it's best not to use them in bluk operations (like that inner loop or slashdotted home page). And remember when all data was safely stored as UTF-8, there will be no need to 'sanitize' it before displaying.

Parameters:

Type Name Def. Description
string $str ""

The source string, a possible mixture of HTML entities, UTF-8 and code page characters.

int $toDecode self::DECODE_NONE

Which entities

string $encoding "UTF-8"

The encoding to use when converting 8 byte code page characters to UTF-8.


Constants

DECODE_ALL

Constant to indicate that you want to decode all entities when constructing the string.

Value: 1

DECODE_NONE

Constant to indicate that you don't want to decode any entities when constructing the string.

Value: 0

DECODE_UNRESERVED

Constant to indicate that you want to decode all but the entities for reserved characters (&<>'") when constructing the string.

Value: 2

ENC_CP_1251

Constant to denote CP-1251 encoding.

Value: "CP-1251"

ENC_ISO_8859_1

Constant to denote ISO-8859-1 encoding.

This is the default encoding for \Scrivo\String uses for fixing and comparing.

Value: "ISO-8859-1"


Members


				
private static \Collator $coll

Collator used for sorting.

This is a static shared amongst instances.


				
private int $len

The length of the string (characters not bytes).

Inital value: -1


				
private static array[] $maps

Map to translate 8 byte code page characters to UTF-8 sequences.

Inital value: array(self::ENC_ISO_8859_1 => array(128 => "€", "�", "‚", "ƒ", "„", "…", "†", "‡", "ˆ", "‰", "Š", "‹", "Œ", "�", "Ž", "�", "�", "‘", "’", "“", "”", "•", "–", "—", "˜", "™", "š", "›", "œ", "�", "ž", "Ÿ", " ", "¡", "¢", "£", "¤", "¥", "¦", "§", "¨", "©", "ª", "«", "¬", "­", "®", "¯", "°", "±", "²", "³", "´", "µ", "¶", "·", "¸", "¹", "º", "»", "¼", "½", "¾", "¿", "À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", "Ë", "Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "×", "Ø", "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", "å", "æ", "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ", "ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ"), self::ENC_CP_1251 => array(128 => "Ђ", "Ѓ", "‚", "ѓ", "„", "…", "†", "‡", "€", "‰", "Љ", "‹", "Њ", "Ќ", "Ћ", "Џ", "ђ", "‘", "’", "“", "”", "•", "–", "—", "�", "™", "љ", "›", "њ", "ќ", "ћ", "џ", " ", "Ў", "ў", "Ј", "¤", "Ґ", "¦", "§", "Ё", "©", "Є", "«", "¬", "­", "®", "Ї", "°", "±", "І", "і", "ґ", "µ", "¶", "·", "ё", "№", "є", "»", "ј", "Ѕ", "ѕ", "ї", "А", "Б", "В", "Г", "Д", "Е", "Ж", "З", "И", "Й", "К", "Л", "М", "Н", "О", "П", "Р", "С", "Т", "У", "Ф", "Х", "Ц", "Ч", "Ш", "Щ", "Ъ", "Ы", "Ь", "Э", "Ю", "Я", "а", "б", "в", "г", "д", "е", "ж", "з", "и", "й", "к", "л", "м", "н", "о", "п", "р", "с", "т", "у", "ф", "х", "ц", "ч", "ш", "щ", "ъ", "ы", "ь", "э", "ю", "я"))


				
private string $pos

The current position when iterating.


				
private string $str

The primitive UTF-8 string.


Methods

public mixed __get(string $name)

Implementation of the readable properties using the PHP magic method __get().

Parameters:

Type Name Def. Description
string $name

The name of the property to get.

Returns:

mixed Implementation of the readable properties using the PHP magic method __get().

public string __toString()

Return the primitive UTF-8 string for this instance.

Returns:

string Return the primitive UTF-8 string for this instance.

public int compareTo(\Scrivo\String $str)

Compare this string to another \Scrivo\String object.

Note that this method requires the \Scrivo\String collator to be set, else the method falls back to the default locale for creating a collator and generates a warning.

Parameters:

Type Name Def. Description
\Scrivo\String $str

The string to compare this string to.

Returns:

int Compare this string to another \Scrivo\String object.

public boolean contains(\Scrivo\String $str, int $offset=0, boolean $ignoreCase=false)

Check if the string contains the given substring.

This is the test you normally use strpos(...) !== false for.

Parameters:

Type Name Def. Description
\Scrivo\String $str

The string to search for.

int $offset 0

An offset from where to start the search.

boolean $ignoreCase false

Set to perform an case insensitive lookup.

Returns:

boolean Check if the string contains the given substring.

Throws:

Exception Type Description
\Scrivo\SystemException If the $offset is out of range.
public int count()

Return the character count of the string.

This is an alias for getLength() and part of the implementation of Countable.

Returns:

int Return the character count of the string.

public static \Scrivo\String create(string $str="", int $toDecode=self::DECODE_NONE, string $encoding="UTF-8")

Factory method to construct an \Scrivo\String.

Parameters:

Type Name Def. Description
string $str ""

The string to create the wrapper for. It is assumed that this will be a valid UTF-8 string. If this is not the case, you'll need to set the additional parameters.

int $toDecode self::DECODE_NONE

Which entities

string $encoding "UTF-8"

The encoding to use when converting 8 byte code

Returns:

\Scrivo\String Factory method to construct an \Scrivo\String.

public string current()

Return the current UTF-8 character when iterating.

Note that this method is part of the implementation of Iterator and should not be called from an other context.

Returns:

string Return the current UTF-8 character when iterating.

public boolean equals(\Scrivo\String $str)

Test if this string equals another \Scrivo\String object.

When you want test \Scrivo\String object for equality, use this method and never the equality operator (==) because then you'll compare objects and therefore all data members of \Scrivo\String and this can give you other results (or cast the \Scrivo\String strings to PHP strings before comparing).

Parameters:

Type Name Def. Description
\Scrivo\String $str

The string to compare this string to.

Returns:

boolean Test if this string equals another \Scrivo\String object.

public \Scrivo\String firstOccurranceOf(\Scrivo\String $str, int $part=false, boolean $ignoreCase=false)

Returns the first occurance of a given substring in this string.

Just like the PHP's native strstr and stristr functions this method returns the substring of this string that start with the first occurance of the given a substring in this string. Note that this method throws an exception if an empty string was given as search string and not a warning as strstr does.

Parameters:

Type Name Def. Description
\Scrivo\String $str

The string to search for.

int $part false

Flag to indicate to return the part of the string before the first occurance of the given substring i.o. the part after the substring.

boolean $ignoreCase false

Perform an case insensitive lookup.

Returns:

\Scrivo\String Returns the first occurance of a given substring in this string.

Throws:

Exception Type Description
\Scrivo\SystemException If an empty search string was given.
private string fixCodePageString(string $str, string $encoding)

Convert a string with UTF-8 and code page characters to a valid UTF-8 string.

When converting the input string to UTF-8 all bytes in the 0x80-0xFF range are first tested if they are is a valid UTF-8 byte sequences, if not it is assumed that it is an 8 byte code page character and converted according to the given encoding. Supported encodings are:

  • Utf8string::ENC_ISO_8859_1
  • Utf8string::ENC_CP_1251

Parameters:

Type Name Def. Description
string $str

The string with mixed UTF-8 and and 8 byte code page characters.

string $encoding

The encoding to use when converting 8 byte code page characters to UTF-8.

Returns:

string Convert a string with UTF-8 and code page characters to a valid UTF-8 string.

private string fixString(string $str, int $toDecode=self::DECODE_NONE, string $encoding="UTF-8")

Convert a string with HTML entities, UTF-8 and code page characters to a valid UTF-8 string.

When converting the input string to UTF-8 all bytes in the 0x80-0xFF range are first tested if they are is a valid UTF-8 byte sequences, if not it is assumed that it is an 8 byte code page character and converted according to the given encoding. Supported encodings are:

  • Utf8string::ENC_ISO_8859_1
  • Utf8string::ENC_CP_1251

You can opt to convert HTML entities in the string to their corresponding characters. Possible choices are:

  • Utf8string::DECODE_NONE don't decode HTML entities
  • Utf8string::DECODE_ALL, decode all HTML entities;
  • Utf8string::DECODE_UNRESERVED, decode all but the HTML entities for <>&' and ' (HTML/XML)

Parameters:

Type Name Def. Description
string $str

The source string, a possible mixture of HTML entities, UTF-8 and code page characters.

int $toDecode self::DECODE_NONE

Which entities

string $encoding "UTF-8"

The encoding to use when converting 8 byte code page characters to UTF-8.

Returns:

string Convert a string with HTML entities, UTF-8 and code page characters to a valid UTF-8 string.

public static \Collator getCollator()

Get the collator for sorting strings.

Returns:

\Collator Get the collator for sorting strings.

public int getLength()

Get the length of the string.

Returns:

int Get the length of the string.

public mixed inArray(\Scrivo\String $arr)

Check if this string exists an array of \Scrivo\String-s.

Parameters:

Type Name Def. Description
\Scrivo\String $arr

The array to search.

Returns:

mixed Check if this string exists an array of \Scrivo\String-s.

public int indexOf(\Scrivo\String $str, int $offset=0, boolean $ignoreCase=false)

Returns the index of the given substring in this string.

Just like the PHP's native strpos and stripos functions this method returns the index of a substring in this string. But there are two important differences: this method returns -1 if the substring was not found, and this method will raise an exception if the given offset was out of range.

Parameters:

Type Name Def. Description
\Scrivo\String $str

The string to search for.

int $offset 0

An offset from where to start the search.

boolean $ignoreCase false

Set to perform an case insensitive lookup.

Returns:

int Returns the index of the given substring in this string.

Throws:

Exception Type Description
\Scrivo\SystemException If the $offset is out of range.
private int isUtf8Sequence(string $seq)

Test if a given byte sequence is a valid UTF-8 sequence.

If the tested byte sequence is a valid UTF-8 sequence the method returns the length of the sequence, else the method returns 0.

Parameters:

Type Name Def. Description
string $seq

The byte sequence to test.

Returns:

int Test if a given byte sequence is a valid UTF-8 sequence.

public int key()

Return the index of the current UTF-8 character when iterating.

Note that this method is part of the implementation of Iterator and should not be called from an other context.

Returns:

int Return the index of the current UTF-8 character when iterating.

public int lastIndexOf(\Scrivo\String $str, int $offset=0, boolean $ignoreCase=false)

Returns the index of the last occurance of the given substring in this string.

Just like the PHP's native strrpos and strripos functions this method returns the substring of this string that start with the first occurance of the given a substring in this string. But note that this method will throw an exception if the offset is invalid. Also an negative offset to indicate an offset measured from the end of the string is allowed. But there are two important differences: this method returns -1 if the substring was not found, and this method will raise an exception if the given offset was out of range.

Parameters:

Type Name Def. Description
\Scrivo\String $str

The string to search for.

int $offset 0

An offset from where to start the search. A positive value indicates an offset measured from the start of the string, a negative value from the end of the string.

boolean $ignoreCase false

Perform an case insensitive lookup.

Returns:

int Returns the index of the last occurance of the given substring in this string.

Throws:

Exception Type Description
\Scrivo\SystemException If the $offset is out of range.
public \Scrivo\String lastOccurranceOf(\Scrivo\String $str, int $part=false, boolean $ignoreCase=false)

Returns the last occurance of a given character in this string.

Just like the PHP's native strrchr and strrichr functions this method returns the substring of this string that start with the first occurance of the given a substring in this string. Note that this method throws an exception if an empty string was given as search string and not a warning as strstr does.

Parameters:

Type Name Def. Description
\Scrivo\String $str

The character to search for.

int $part false

Flag to indicate to return part of the string before the last occurance of the given character i.o. the part after the character.

boolean $ignoreCase false

Perform an case insensitive lookup.

Returns:

\Scrivo\String Returns the last occurance of a given character in this string.

Throws:

Exception Type Description
\Scrivo\SystemException If a search string of not exactly one character in length was given.
public next()

Move forward in this string to the next UTF-8 character when iterating.

Note that this method is part of the implementation of Iterator and should not be called from an other context.

public boolean offsetExists(int $offset)

Check if the specified index location in this string is valid.

Note that this method is part of the implementation of ArrayAccess and should not be called from an other context.

Parameters:

Type Name Def. Description
int $offset

A character offet in the string.

Returns:

boolean Check if the specified index location in this string is valid.

public offsetGet(int $offset)

Get an UTF-8 character from a string using array brackets.

Note that this method is part of the implementation of ArrayAccess and should not be called from an other context.

Parameters:

Type Name Def. Description
int $offset

A character offet in the string.

Throws:

Exception Type Description
\Scrivo\SystemException If the requested offset was out of range.
public offsetSet(int $offset, string $value)

Illegal method: set a character at a specified index location.

Note that this method is part of the implementation of ArrayAccess. \Scrivo\Strings are immutable and therefore it is prohibited to set elements (characters) in a string, so this method implementation is not relevant and throws an exception if called.

Parameters:

Type Name Def. Description
int $offset
string $value

Throws:

Exception Type Description
\Scrivo\SystemException If this method is called.
public offsetUnset(int $offset)

Illegal method: unset a character at a specified index location.

Note that this method is part of the implementation of ArrayAccess. \Scrivo\Strings are immutable and therefore it is prohibited to unset elements (characters) in a string, so this method implementation is not relevant and throws an exception if called.

Parameters:

Type Name Def. Description
int $offset

Throws:

Exception Type Description
\Scrivo\SystemException If this method is called.
public \Scrivo\String replace(\Scrivo\String $from, \Scrivo\String $to)

Replace a substring or set of substrings in this string.

You can use this method in favour of PHP's native str_replace and strtr functions. This method will do proper type checking for you.

Parameters:

Type Name Def. Description
\Scrivo\String $from

A (set of) string(s) to replace in this string.

\Scrivo\String $to

A (set of) replacement string(s) to replace the found string(s).

Returns:

\Scrivo\String Replace a substring or set of substrings in this string.

Throws:

Exception Type Description
\Scrivo\SystemException If the input data is not of type \Scrivo\String or \Scrivo\String[], of if the $to parameter is an array and $from isn't or hasn't the same number of elements.
public rewind()

Reset the current character index so iterating will (re)start at the beginning of this string.

Note that this method is part of the implementation of Iterator and should not be called from an other context.

public static setCollator(\Collator $coll)

Set the collator for sorting strings.

Parameters:

Type Name Def. Description
\Collator $coll

The collator to use.

public \Scrivo\String[] split(\Scrivo\String $delimiter, int $limit=0)

Split this string using a delimiter.

Just like PHP's native explode this method splits a string on boundaries formed by the string delimiter. Note that the behavoir of the limit parameter is a little bit different and that this method will throw an exception if an empty string is passed as a delimiter.

Parameters:

Type Name Def. Description
\Scrivo\String $delimiter

The boundary string.

int $limit 0

If limit is set and positive, the returned array will contain a maximum of limit elements with the last element containing the rest of string. If the limit parameter is negative, all components except the last -limit are returned. If the limit is not set or 0 no limit wil be used.

Returns:

\Scrivo\String[] Split this string using a delimiter.

Throws:

Exception Type Description
\Scrivo\SystemException If an empty search string was given.
public \Scrivo\String substr(int $start, int $length=65535)

Get a substring from a string using an offset and a length.

Just like PHP's native substr function this method returns a substring from this string using an offset and a length. But note that this method will throw an exception if the offset is invalid.

Parameters:

Type Name Def. Description
int $start

Start offset for the substring, use a negative number to use an offset from the end of the string.

int $length 65535

The length of the substring.

Returns:

\Scrivo\String Get a substring from a string using an offset and a length.

Throws:

Exception Type Description
\Scrivo\SystemException if the requested offset was out of range.
public \Scrivo\String substring(int $start, int $end)

Get a substring from a string using a start and end index.

This method is inspired by it's JAVA counterpart and returns a substring of this string using an start and end index.

Parameters:

Type Name Def. Description
int $start

Start offset for the substring.

int $end

The end offset for the substring.

Returns:

\Scrivo\String Get a substring from a string using a start and end index.

Throws:

Exception Type Description
\Scrivo\SystemException if the requested offset was out of range.
public \Scrivo\String toLowerCase()

Get a copy of this string with all of its characters converted to lower case.

Returns:

\Scrivo\String Get a copy of this string with all of its characters converted to lower case.

public \Scrivo\String toUpperCase()

Get a copy of this string with all of its characters converted to upper case.

Returns:

\Scrivo\String Get a copy of this string with all of its characters converted to upper case.

public \Scrivo\String trim()

Get a trimmed copy of this string.

Returns a copy of the string, with leading and trailing whitespace removed. Whitespace characters are: ' ', \t, \r, \n, the character for a non breaking space.

Returns:

\Scrivo\String Get a trimmed copy of this string.

private \Scrivo\String unsafeSubstr(int $start, int $length)

Get a substring from a string without first checking the boundaries.

Parameters:

Type Name Def. Description
int $start

Start offset for the substring, use a negative number to use an offset from the end of the string.

int $length

The length of the substring.

Returns:

\Scrivo\String Get a substring from a string without first checking the boundaries.

public boolean valid()

Check if the current character index for iterating is valid.

Note that this method is part of the implementation of Iterator and should not be called from an other context.

Returns:

boolean Check if the current character index for iterating is valid.


Documentation generated by phpDocumentor 2.0.0a12 and ScrivoDocumentor on August 29, 2013