xapian parser bug?

classic Classic list List threaded Threaded
7 messages Options
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

xapian parser bug?


Today we noticed that keywords can't be searched as prefixed terms. Or
that's what it looks like anyway. I tested and, or, and not.

╰─% NOTMUCH_DEBUG_QUERY=y notmuch search 'subject:"and"'
Query string is:
subject:"and"
notmuch search: A Xapian exception occurred
A Xapian exception occurred parsing query: Syntax: <expression> AND <expression>
Query string was: subject:"and"

╰─% NOTMUCH_DEBUG_QUERY=y notmuch search 'subject:"or"'
Query string is:
subject:"or"
notmuch search: A Xapian exception occurred
A Xapian exception occurred parsing query: Syntax: <expression> OR <expression>
Query string was: subject:"or"

╰─% NOTMUCH_DEBUG_QUERY=y notmuch search 'subject:"not"'
Query string is:
subject:"not"
notmuch search: A Xapian exception occurred
A Xapian exception occurred parsing query: Syntax: <expression> NOT <expression>
Query string was: subject:"not"

Interestingly, putting space around the operator seems to be a
workaround. Something about turning on phrase parsing maybe?

╰─% NOTMUCH_DEBUG_QUERY=y notmuch count 'subject:" not "'
Query string is:
subject:" not "
Exclude query is:
Query((((Kspam OR Kdeleted) OR Kmuted) OR Kbad-address))
Final query is:
Query(((Tmail AND 0 * XSUBJECTnot@1) AND_NOT (((Kspam OR Kdeleted) OR Kmuted) OR Kbad-address)))
9927
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
James Aylett James Aylett
Reply | Threaded
Open this post in threaded view
|

Re: xapian parser bug?

David — this isn't the behaviour I see what QueryParser alone, unless you're driving it in a way I don't expect. In python:

>>> import xapian
>>> qp = xapian.QueryParser()
>>> qp.add_prefix('subject', 'S')
>>> str(qp.parse_query('subject:"and"', qp.FLAG_DEFAULT|qp.FLAG_BOOLEAN_ANY_CASE))
'Query(Sand@1)'
>>> str(qp.parse_query('subject:"or"', qp.FLAG_DEFAULT|qp.FLAG_BOOLEAN_ANY_CASE))
'Query(Sor@1)'
>>> str(qp.parse_query('subject:"not"', qp.FLAG_DEFAULT|qp.FLAG_BOOLEAN_ANY_CASE))
'Query(Snot@1)'
>>> str(qp.parse_query('subject:" not "', qp.FLAG_DEFAULT|qp.FLAG_BOOLEAN_ANY_CASE))
'Query(Snot@1)'

Note that I'm using 1.4.7, and from your output I believe you're not (the * in the query description I believe doesn't happen in those situations any more).

J

> On 29 Sep 2018, at 23:09, David Bremner <[hidden email]> wrote:
>
>
> Today we noticed that keywords can't be searched as prefixed terms. Or
> that's what it looks like anyway. I tested and, or, and not.
>
> ╰─% NOTMUCH_DEBUG_QUERY=y notmuch search 'subject:"and"'
> Query string is:
> subject:"and"
> notmuch search: A Xapian exception occurred
> A Xapian exception occurred parsing query: Syntax: <expression> AND <expression>
> Query string was: subject:"and"
>
> ╰─% NOTMUCH_DEBUG_QUERY=y notmuch search 'subject:"or"'
> Query string is:
> subject:"or"
> notmuch search: A Xapian exception occurred
> A Xapian exception occurred parsing query: Syntax: <expression> OR <expression>
> Query string was: subject:"or"
>
> ╰─% NOTMUCH_DEBUG_QUERY=y notmuch search 'subject:"not"'
> Query string is:
> subject:"not"
> notmuch search: A Xapian exception occurred
> A Xapian exception occurred parsing query: Syntax: <expression> NOT <expression>
> Query string was: subject:"not"
>
> Interestingly, putting space around the operator seems to be a
> workaround. Something about turning on phrase parsing maybe?
>
> ╰─% NOTMUCH_DEBUG_QUERY=y notmuch count 'subject:" not "'
> Query string is:
> subject:" not "
> Exclude query is:
> Query((((Kspam OR Kdeleted) OR Kmuted) OR Kbad-address))
> Final query is:
> Query(((Tmail AND 0 * XSUBJECTnot@1) AND_NOT (((Kspam OR Kdeleted) OR Kmuted) OR Kbad-address)))
> 9927
>

--
 James Aylett
 devfort.com — spacelog.org — tartarus.org/james/

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Olly Betts Olly Betts
Reply | Threaded
Open this post in threaded view
|

Re: xapian parser bug?

On Sun, Sep 30, 2018 at 09:50:30AM +0100, James Aylett wrote:
> Note that I'm using 1.4.7, and from your output I believe you're not
> (the * in the query description I believe doesn't happen in those
> situations any more).

1.4.4 and later eliminate redundant 0 scaling factors, but this one
isn't actually redundant:

> > Query(((Tmail AND 0 * XSUBJECTnot@1) AND_NOT (((Kspam OR Kdeleted) OR Kmuted) OR Kbad-address)))

If it was on the right-hand side of AND_NOT it would be eliminated
(because the right-hand side doesn't contribute any weight anyway).

FWIW, I also couldn't reproduce this (I tried with quest and 1.4.7):

$ quest -psubject:S -fdefault,boolean_any_case 'subject:"and"'
Parsed Query: Query(Sand@1)

Cheers,
    Olly
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: xapian parser bug?

Olly Betts <[hidden email]> writes:

>
> FWIW, I also couldn't reproduce this (I tried with quest and 1.4.7):
>
> $ quest -psubject:S -fdefault,boolean_any_case 'subject:"and"'
> Parsed Query: Query(Sand@1)
>

Ah, OK, it must have something to do with the way that notmuch is using
field processors. And I see now that the following code (from
lib/regexp-fields.cc) is probably related (at least it explains
subject:" not" works)

            if (str.find (' ') != std::string::npos)
                query_str = '"' + str + '"';
            else
                query_str = str;

            return parser.parse_query (query_str, NOTMUCH_QUERY_PARSER_FLAGS, term_prefix);

The motivation for not always triggering phrase processing is that it
breaks/disables wildcards. In particular this change was to fix the
query 'subject:foo*'.  The difficulty here is that the field processor
doesn't know if its string argument was originally quoted.
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: xapian parser bug?

David Bremner <[hidden email]> writes:

> Olly Betts <[hidden email]> writes:
>
>>
>> FWIW, I also couldn't reproduce this (I tried with quest and 1.4.7):
>>
>> $ quest -psubject:S -fdefault,boolean_any_case 'subject:"and"'
>> Parsed Query: Query(Sand@1)
>>
>
> Ah, OK, it must have something to do with the way that notmuch is using
> field processors. And I see now that the following code (from
> lib/regexp-fields.cc) is probably related (at least it explains
> subject:" not" works)
>
>             if (str.find (' ') != std::string::npos)
> query_str = '"' + str + '"';
>    else
> query_str = str;
>
>    return parser.parse_query (query_str, NOTMUCH_QUERY_PARSER_FLAGS, term_prefix);

For the record, I have proposed a fix for notmuch (str is known to be
non-empty there). This will phrase quote by default, unless the string
looks like a wildcard query (without spaces).

diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc
index 084bc8c0..52f30d82 100644
--- a/lib/regexp-fields.cc
+++ b/lib/regexp-fields.cc
@@ -194,7 +194,7 @@ RegexpFieldProcessor::operator() (const std::string & str)
             * phrase parsing, when possible */
            std::string query_str;
 
-           if (str.find (' ') != std::string::npos)
+           if (*str.rbegin () != '*' || str.find (' ') != std::string::npos)
                query_str = '"' + str + '"';
            else
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Olly Betts Olly Betts
Reply | Threaded
Open this post in threaded view
|

Re: xapian parser bug?

In reply to this post by David Bremner-2
On Sun, Sep 30, 2018 at 09:05:25AM -0300, David Bremner wrote:
>             if (str.find (' ') != std::string::npos)
> query_str = '"' + str + '"';
>    else
> query_str = str;
>
>    return parser.parse_query (query_str, NOTMUCH_QUERY_PARSER_FLAGS, term_prefix);

I wouldn't recommend trying to generate strings to feed to QueryParser
like this code seems to be doing.  QueryParser aims to parse input from
humans not machines.

As well as the case where str is an operation name, the code above looks
like it will mishandle cases where str contains a tab or double quotes.
There are likely other problem cases too.

Cheers,
    Olly
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: xapian parser bug?

Olly Betts <[hidden email]> writes:

> On Sun, Sep 30, 2018 at 09:05:25AM -0300, David Bremner wrote:
>>             if (str.find (' ') != std::string::npos)
>> query_str = '"' + str + '"';
>>    else
>> query_str = str;
>>
>>    return parser.parse_query (query_str, NOTMUCH_QUERY_PARSER_FLAGS, term_prefix);
>
> I wouldn't recommend trying to generate strings to feed to QueryParser
> like this code seems to be doing.  QueryParser aims to parse input from
> humans not machines.

str is the parameter to the FieldProcessor () operator.  The field
processor needs a way to approximate the standard probabilistic prefix
parsing in the fallback case. The addition of quotes is to force the
generation of a phrase query, otherwise e.g. subject:"christmas party"
doesn't work out well.

I tried using OP_PHRASE as a the default operators, but it doesn't
handle some cases I need.

% quest -o phrase 'bob jones <[hidden email]>'      
UnimplementedError: OP_NEAR and OP_PHRASE only currently support leaf subqueries

If I don't recursively call parse_query, then I guess I need to generate
terms in a compatible way before turning them into a phrase query. Maybe
that's not as hard as I orginally thought, since being in phrase turns
off the stemmer anyway iiuc.  Is there a Xapian API I can use to extract
 "bob", "jones", "bob", "example", "com" from the example above? I guess
 I guess I could use a throwaway Xapian::Document and a TermGenerator
 (basically aping xapian_core/tests/api_termgen.cc).

d
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch