Difference between revisions of "Strings"

From CVC4
Jump to: navigation, search
m
(Started polishing text)
Line 6: Line 6:
 
'''We highly recommend that users use SMT-LIB [http://smt-lib.org/language.shtml Version 2.5], instead of Version 2.0.''' The major difference is in the definition of escape sequences for string literals.
 
'''We highly recommend that users use SMT-LIB [http://smt-lib.org/language.shtml Version 2.5], instead of Version 2.0.''' The major difference is in the definition of escape sequences for string literals.
  
'''The syntax below is for CVC4 version > 1.4. Version 1.3 has only has ''partial'' support for syntax in this document.'''
+
'''The syntax below is for CVC4 version > 1.4.''' Version 1.3 has only has ''partial'' support for syntax in this document.
  
Since the string subsolver is still relatively new, the current stable version of CVC4 (1.4) does not provide the latest version of that subsolver.  Please use our '''latest Development version instead'''.
+
Since the string (sub)solver is still relatively new, the current stable version of CVC4 (1.4) does not provide the latest version of that solver.  Please use our latest Development version instead.
  
This string solver has a full support of '''ASCII''' characters. For arbitrary alphabets, please refer to sequence (parametrized string) theory.
+
Currently, the string solver supports string constants over a set characters limited to printable ASCII characters. Other characters ''must'' be encoded with escape sequences. For arbitry alphabets, we plan to provide later a separate solver for a theory of parametric sequences.
  
In compliance with SMT-Lib v2, only printable basic ASCII characters are allowed to be appeared natively in string literals. Other ASCII characters '''must''' be encoded as escape sequences.
+
To use the string solver it is important to declare initially (using the <code>set-logic</code> command) an SMT-LIB logic that includes strings. Since the SMT-LIB standard does not have an official theory of strings and related logics yet, the logic names described below are tentive and might change later.
  
The Theory of Strings (Quantifier-Free) logic symbol:
+
The basic logic is <code>QF_S</code> consisting of quanfier-free formulas over just the theory of strings, e.g., :
 
   (set-logic QF_S)
 
   (set-logic QF_S)
  
The logic symbol is '''IMPORTANT''' to the Theory of String, and it has to be set up.
+
If the formulas contain sybols from theories, please add them accordingly, e.g., if the formulas contain symbols from the BitVector theory, the logic symbol should be <code>QF_SBV</code>.
  
If the constraints contain more theories, please add them accordingly, e.g. if it contains BitVector, the symbol should be QF_SBV.
+
The logic consisting of the full (quantified) theory strings is <code>QF_SBV</code>:
 
+
The Theory of Quantified Strings logic symbol:
+
 
   (set-logic S)
 
   (set-logic S)
  
Since the theory of strings is pretty new in CVC4, we are still working on the standards for syntax.
 
If you have some comments or suggestions about CVC4 strings, please feel feel to email me: tianyi-liang@uiowa.edu
 
  
==Unsat Core==
+
 
String engine '''does''' support unsat core. To generate unsat cores, you must enable proofs at configure time, and
+
==Unsat Cores==
then run with "--dump-unsat-cores".
+
The string solver supports the generation of unsatisfiable core. As wilt other subsolvers though you must enable proofs at configuration time, and
 +
then run CVC with "--dump-unsat-cores" flag.
  
 
==Options==
 
==Options==
To use the experimental functions (disabled by default, even in ALL_SUPPORTED mode):
+
Some functions in the theory are have only experimental support currently
 +
and are disabled by default (even in the <code>ALL_SUPPORTED</code> logic:
 +
To use them:
 
   (set-option :strings-exp true)
 
   (set-option :strings-exp true)
  
To use finite model finding mode (false by default):
+
The solver can be run in ''finite model finding mode'' which guarantees termination for satisfiable problems. This mode is disabled by default. To enable it:
 
   (set-option :strings-fmf true)
 
   (set-option :strings-fmf true)
The string finite model finding mode is much '''slower''' than the default mode. We highly recommend you '''NOT''' using it if it is not necessary.
+
Note that in this mode the solver is much '''slower''' than in default mode. So we recommend it only as a fall back option when the default mode fails to find a solution within a reasonably large timeout.
  
 
To select the strategy of LB rule application: 0-lazy, 1-eager, 2-no (0 by default):
 
To select the strategy of LB rule application: 0-lazy, 1-eager, 2-no (0 by default):
Line 45: Line 44:
 
   (set-option :strings-alphabet-card n)
 
   (set-option :strings-alphabet-card n)
 
This is a reserved option for the extension of the sequence theory.
 
This is a reserved option for the extension of the sequence theory.
 +
 +
==Alphabet==
 +
Currently, the solver's theory is based on an alphabet consisting of the 256 characters from (8-bit)Extended ASCII. Since there are several versions of Extended ASCII we allow string constants to contain only ''printable US ASCII characters'', which are encoded in the same way in all Extended ASCII versions.
 +
 +
'''Note:''' The alphabet will change to the one prescribed by the SMT-LIB standard once there is one.
  
 
==Printable Characters==
 
==Printable Characters==
A ''printable'' character is any character between 0x20 and 0x7e in the standard ASCII table.
+
A ''printable'' character is any character with numerical value between 0x20 and 0x7e in the standard US ASCII encoding.
  
 
==Escape Sequences for String Literals==
 
==Escape Sequences for String Literals==
 +
 +
We support escape sequences common in most programming languages to represent non-printable characters.
 +
 
{| border="1"
 
{| border="1"
 
|-
 
|-
| \0 … \9
+
| <code>\0</code> <code>\9</code>
 
| represents ASCII character 0 … 9, respectively
 
| represents ASCII character 0 … 9, respectively
 
|-
 
|-
| \a, \b, \e, \f, \n, \r, \t, \v
+
| <code>\a</code>, <code>\b</code>, <code>\e</code>, <code>\f</code>, <code>\n</code>, <code>\r</code>, <code>\t</code>, <code>\v</code>
 
| represents its corresponding ASCII character (C++ convention)
 
| represents its corresponding ASCII character (C++ convention)
 
|-
 
|-

Revision as of 14:07, 9 June 2015

This page describes support for the theory of strings in CVC4.

Syntax

This document focuses on input written in SMT-LIB 2 format. A frontend for CVC4's native syntax is not available yet.

We highly recommend that users use SMT-LIB Version 2.5, instead of Version 2.0. The major difference is in the definition of escape sequences for string literals.

The syntax below is for CVC4 version > 1.4. Version 1.3 has only has partial support for syntax in this document.

Since the string (sub)solver is still relatively new, the current stable version of CVC4 (1.4) does not provide the latest version of that solver. Please use our latest Development version instead.

Currently, the string solver supports string constants over a set characters limited to printable ASCII characters. Other characters must be encoded with escape sequences. For arbitry alphabets, we plan to provide later a separate solver for a theory of parametric sequences.

To use the string solver it is important to declare initially (using the set-logic command) an SMT-LIB logic that includes strings. Since the SMT-LIB standard does not have an official theory of strings and related logics yet, the logic names described below are tentive and might change later.

The basic logic is QF_S consisting of quanfier-free formulas over just the theory of strings, e.g., :

 (set-logic QF_S)

If the formulas contain sybols from theories, please add them accordingly, e.g., if the formulas contain symbols from the BitVector theory, the logic symbol should be QF_SBV.

The logic consisting of the full (quantified) theory strings is QF_SBV:

 (set-logic S)


Unsat Cores

The string solver supports the generation of unsatisfiable core. As wilt other subsolvers though you must enable proofs at configuration time, and then run CVC with "--dump-unsat-cores" flag.

Options

Some functions in the theory are have only experimental support currently and are disabled by default (even in the ALL_SUPPORTED logic: To use them:

 (set-option :strings-exp true)

The solver can be run in finite model finding mode which guarantees termination for satisfiable problems. This mode is disabled by default. To enable it:

 (set-option :strings-fmf true)

Note that in this mode the solver is much slower than in default mode. So we recommend it only as a fall back option when the default mode fails to find a solution within a reasonably large timeout.

To select the strategy of LB rule application: 0-lazy, 1-eager, 2-no (0 by default):

 (set-option :strings-lb 1)

To set up string alphabet cardinality (256 by default, expert option):

 (set-option :strings-alphabet-card n)

This is a reserved option for the extension of the sequence theory.

Alphabet

Currently, the solver's theory is based on an alphabet consisting of the 256 characters from (8-bit)Extended ASCII. Since there are several versions of Extended ASCII we allow string constants to contain only printable US ASCII characters, which are encoded in the same way in all Extended ASCII versions.

Note: The alphabet will change to the one prescribed by the SMT-LIB standard once there is one.

Printable Characters

A printable character is any character with numerical value between 0x20 and 0x7e in the standard US ASCII encoding.

Escape Sequences for String Literals

We support escape sequences common in most programming languages to represent non-printable characters.

\0\9 represents ASCII character 0 … 9, respectively
\a, \b, \e, \f, \n, \r, \t, \v represents its corresponding ASCII character (C++ convention)
\ooo matches an ASCII character, where ooo consists of (no more than) three digits that represent the octal character code (from 0 to 377). For example, \101 represents ‘A’, while “\437” represent a string with two characters “#7”( *important* ).
\xNN matches an ASCII character, where NN is a two-digit hexadecimal character code. NN has to be exactly two hex-digits. If not, an exception will be raised.

The backslash character (\) : when followed by a character that is not recognized as an escaped character, matches that character. For example, \" matches a double quote(").

In CVC4 string literal output, a non-printable/extended ASCII character is printed in the format \xNN, where NN matches its ASCII character, except for \a, \b, \e, \f, \n, \r, \t, \v, which are printed directly.

Note: Escaped character literals are for version (> 1.3). They are not compatible with the string literal standards in SMT-Lib v2.

To be compliant with SMT-Lib, users are required to add another layer of escapes. For example, in v2.0, both \t and \\t mean the tab character, and both \\\t and \\\\t mean the two characters \ and t. In v2.5, only \t means the tab character, and \\t means the two characters \ and t. In v2.5, both \"" and "" means the character ". Please check the string literals section in SMT-Lib v2.5 standard for more details.

In string literals (for the string engine) that are following the SMT-Lib v2.5 specification, it is mandatory to escape \ (by \\) and " (by "") to prevent parse error.

Strings

To define a string variable:

 (declare-fun x () String)

String Concatenation:

 (str.++ s1 s2 ... sn)

where s1, s2, ..., and sn are string terms. String concatenation takes at least 2 arguments.

String Length:

 (str.len s)

where s is a string term.

Character in String:

 (str.at s i)

where s is a string term and i is a natural number. (see partial functions) The index is starting from 0.

Sub-String:

 (str.substr s i j )

where s is a string term, i and j are natural numbers. (see partial functions)

Escape Sequences for Regular Expressions

Currently, it is for CVC format only. (coming soon)

Symbolic Regular Expression

Membership Constraint:

 (str.in.re s r)

where s is a string term and r is a regular expression.

String to Regular Expression Conversion:

 (str.to.re s)

where s is a string term. The statement turns a regular expression that only contains a string s.

Regular Expression Concatenation:

 (re.++ r_1 r_2 ... r_n)

where r_1, r_2, ..., r_n are regular expressions.

Regular Expression Alternation:

 (re.union r_1 r_2 ... r_n)

where r_1, r_2, ..., r_n are regular expressions. re.or is for releases before March, 2014.

Regular Expression Intersection:

 (re.inter r_1 r_2 ... r_n)

where r_1, r_2, ..., r_n are regular expressions. re.itr is for releases before March, 2014.

Regular Expression Kleene-Star:

 (re.* r)

where r is a regular expression.

Regular Expression Kleene-Cross:

 (re.+ r)

where r is a regular expression.

Regular Expression Option:

 (re.opt r)

where r is a regular expression.

Regular Expression Range:

 (re.range s t)

where s, t are single characters in double quotes, e.g. "a", "b". It returns a regular expression that contains any character between s and t.

Regular Expression Loop:

 (re.loop r l u)

where r is a regular expression, l is a non-negative constant integer, and u is a non-negative constant integer. It returns a regular expression that contains at least l repetitions of r and at most u repetitions of r. If l >= u, it returns exactly l repetitions of r.

Regular Expression Loop-2:

 (re.loop r l)

where r is a regular expression, and l is a non-negative constant integer. It returns a regular expression that contains at least l repetitions of r.


The Empty Regular Expression:

 re.nostr

The Regular Expression that contains all characters:

 re.allchar

Experimental Mode

Following functions are under the --strings-exp option. They are under active refinement. Once they are stable, we will move them to the default mode. Please let us know when you have some suggestions.

String Char-At:

 (str.at s i)

where s is a string term and i is an integer term. i is the position. See partial functions section.

String Sub-string:

 (str.substr s i j)

where s is a string term and i, j are integer terms. i is the starting position, and j is the offset. See partial functions section.

String Contain:

 (str.contains s t)

where s and t are string terms. It returns true if the string s contains the string t. This function determines whether the string t can be found within the string s, returning true or false as appropriate.

String IndexOf:

(str.indexof s t i)

where s is a string, t is a non-empty string and i is a non-negative integer. This function returns the position of the first occurrence of the specified value t in the string s after the index i. It returns -1 if the value to search for never occurs.

String Replacement:

(str.replace s t1 t2)

where s, t1 and t2 are string terms, t1 is non-empty. This function searches the string s for the specified value t1, and returns a new string where the first occurrence of the specified value t1 is replaced by the string t2.

String PrefixOf:

(str.prefixof s t)

where s and t are string terms. It returns true if the string s is a prefix of the string t.

String SuffixOf:

(str.suffixof s t)

where s and t are string terms. It returns true if the string s is a suffix of the string t.

String To Integer Conversion:

(str.to.int s)

where s is a string term. It returns the corresponding natural number if s is valid; otherwise, it returns -1.

Integer To String Conversion:

(int.to.str i)

where i is an integer term. It returns the corresponding string if i is a natural number; otherwise, it returns an empty string.

Partial Functions

By the definition of partial functions in SMT-Lib Document, the following constraint is satisfiable:

 (assert (= (str.substr "a" 2 10) "b"))

However, the following constraints are unsatisfiable:

 (assert (= (str.substr "a" 2 1) "b"))
 (assert (= (str.substr "a" 2 1) "c"))

To achieve a desirable goal, it requires users to guard proper conditions. For example,

 (assert (= (str.at x j) "b"))
 (assert (> j 0))
 (assert (> (str.len x) j))

Extension

Together with other engine in CVC4, we can extend new functionality in the theory of strings. For example,

 (define-fun fun1 ((?x String) (?s String)) Bool
   (or (= ?x ?s)
       (> (str.len ?x) (str.len ?s))
    ))

Quantifiers over bounded Integers (with strings in the body) are supported in the experimental mode; however, quantifiers over strings are still under development.

Limitation

The decidability of this theory is unknown. For satisfiable problems (without extensions), our solver is sound, complete and terminating in the FMF mode (although the FMF mode will be slower than the default mode in general). For unsatisfiable problems, the termination is not guaranteed; however, user can tune the options for termination.

Current version supports ASCII characters only. We will move on to UNICODE in the future version.

Examples

Find an assignment for x, where x."ab"="ba".x and the length of x equals to 7.

 (set-logic QF_S)
 
 (declare-fun x () String)
 
 (assert (= (str.++ x "ab") (str.++ "ba" x)))
 (assert (= (str.len x) 7))
 
 (check-sat)

Find assignments for x and y, where x and y are distinct and their lengths are equal.

 (set-logic QF_S)
 
 (declare-fun x () String)
 (declare-fun y () String)
 
 (assert (not (=  x y)))
 (assert (= (str.len x) (str.len y)))
 
 (check-sat)

Find assignments for x and y, where x.y != y.x.

 (set-logic QF_S)
 
 (declare-fun x () String)
 (declare-fun y () String)
 (assert (not (= (str.++ x y) (str.++ y x))))
 
 (check-sat)

Find a model for x, y and z, where x."ab".y=y."ba".z and z=x.y and x."a"!="a".x.

 (set-logic QF_S)
 
 (declare-fun x () String)
 (declare-fun y () String)
 (declare-fun z () String)
 
 (assert (= (str.++ x "ab" y) (str.++ y "ba" z)))
 (assert (= z (str.++ x y)))
 (assert (not (= (str.++ x "a") (str.++ "a" x))))
 
 (check-sat)

Find a model for x and y, where both x and y are in the RegEx (a*b)* and they are different but have the same length.

 (set-logic QF_S)
 (set-option :strings-fmf true)
 
 (declare-fun x () String)
 (declare-fun y () String)
 
 (assert
   (str.in.re x
      (re.* (re.++ (re.* (str.to.re "a") ) (str.to.re "b") ))))
 (assert (str.in.re y
      (re.* (re.++ (re.* (str.to.re "a") ) (str.to.re "b") ))))
 
 (assert (not (= x y)))
 (assert (= (str.len x) (str.len y)))
 
 (check-sat)

API

More details can be found in the Tutorials.

C++

The example can be found in examples/api/strings.cpp.

If setting the logic, use "S" to enable theory of strings.

 smt.setLogic("S");

To create a string type, call mkSetType in the ExprManager.

 Type string = em.stringType();

Make some string literals:

 // std::string
 std::string std_str_ab("ab");
 // CVC4::String
 CVC4::String cvc4_str_ab(std_str_ab);
 CVC4::String cvc4_str_abc("abc");
 // String constants
 Expr ab = em.mkConst(cvc4_str_ab);
 Expr abc = em.mkConst(CVC4::String("abc"));

Make some string variables:

 Expr x = em.mkVar("x", string);
 Expr y = em.mkVar("y", string);
 Expr z = em.mkVar("z", string);

Make some string constraints:

 // String concatenation: x.ab.y
 Expr lhs = em.mkExpr(kind::STRING_CONCAT, x, ab, y);
 // String concatenation: abc.z
 Expr rhs = em.mkExpr(kind::STRING_CONCAT, abc, z);
 // x.ab.y = abc.z
 Expr formula1 = em.mkExpr(kind::EQUAL, lhs, rhs);
 // Length of y: |y|
 Expr leny = em.mkExpr(kind::STRING_LENGTH, y);
 // |y| >= 0
 Expr formula2 = em.mkExpr(kind::GEQ, leny, em.mkConst(Rational(0)));
 // Regular expression: (ab[c-e]*f)|g|h
 Expr r = em.mkExpr(kind::REGEXP_UNION,
 em.mkExpr(kind::REGEXP_CONCAT,
 em.mkExpr(kind::STRING_TO_REGEXP, em.mkConst(String("ab"))),
 em.mkExpr(kind::REGEXP_STAR,
 em.mkExpr(kind::REGEXP_RANGE, em.mkConst(String("c")), em.mkConst(String("e")))),
 em.mkExpr(kind::STRING_TO_REGEXP, em.mkConst(String("f")))),
 em.mkExpr(kind::STRING_TO_REGEXP, em.mkConst(String("g"))),
 em.mkExpr(kind::STRING_TO_REGEXP, em.mkConst(String("h"))));
 // String variables
 Expr s1 = em.mkVar("s1", string);
 Expr s2 = em.mkVar("s2", string);
 // String concatenation: s1.s2
 Expr s = em.mkExpr(kind::STRING_CONCAT, s1, s2);
 // s1.s2 in (ab[c-e]*f)|g|h
 Expr formula3 = em.mkExpr(kind::STRING_IN_REGEXP, s, r);

Make a query:

 Expr q = em.mkExpr(kind::AND,
   formula1,
   formula2,
   formula3);

Check the result:

 Result result = smt.checkSat(q);
 std::cout << "CVC4 reports: " << q << " is " << result << "." << std::endl;
 if(result == Result::SAT) {
   std::cout << " x = " << smt.getValue(x) << std::endl;
   std::cout << " s1.s2 = " << smt.getValue(s) << std::endl;
 }

Java

The example can be found in examples/api/java/Strings.java.

References