regex [X-Z] with non-ascii char returns different results from (X|Y|Z)

classic Classic list List threaded Threaded
6 messages Options
yury.t yury.t
Reply | Threaded
Open this post in threaded view
|

regex [X-Z] with non-ascii char returns different results from (X|Y|Z)


Some regular expression returns incorrect results if the pattern contains multibyte characters in square brackets.  The following bracket expression matches subjects not starting with `[1-9]` and returns more results than the parenthesis expression.

(Please note that digits are full width, unicode characters.)




    notmuch count -- 'subject:"/^[1-9]/"' # 961


    notmuch count -- 'subject:"/^(1|2|3|4|5|6|7|8|9)/"' # 32





Somehow non-ascii characters in brackets match with any characters start with same hex code point.  For example:





- [1] (U+FF11) is treated as [\x{F000}-\x{FFFF}]


- ^[倀] (U+5000), ^[啕] (U+5555) and ^[忿] (U+5fff) return same results since they are all "U+5xxx".


Without ^, their results are vary but still contain unrelated subjects.





And curly brackets for repetition also have weird behavior.


If there are two emails whose subject is (A) "1人" and (B) "12人":



- ^(1|2...|9)人 - match A, unmatch B (expected)


- ^(1|2...|9){2}人 - unmatch A, match B (expected)


- ^[1-9]人 and ^[1-9]{2}人 - unmatch both


- ^[1-9]{3}人, {4} and {5} - match A, unmatch B


- ^[1-9]{6}人, {7} and {8} - unmatch A, match B





As noted in manpage of notmuch-search-terms, I surely wrap regular expression with double quotes and entire query with single quotes.  I also increase/decrease $XAPIAN_CJK_NGRAM and rebuild index, but the situation won't change.





_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: regex [X-Z] with non-ascii char returns different results from (X|Y|Z)

"yury.t" <[hidden email]> writes:

> Some regular expression returns incorrect results if the pattern
> contains multibyte characters in square brackets.  The following
> bracket expression matches subjects not starting with `[1-9]` and
> returns more results than the parenthesis expression.

We rely on POSIX.2 regex functions (regcomp, regexec). I would be
interested to know if the searches you are interested in work in a
standalone C program using regcomp and regexec.

d


_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
yury.t yury.t
Reply | Threaded
Open this post in threaded view
|

Re: regex [X-Z] with non-ascii char returns different results from (X|Y|Z)

Thank you for your reply.
I confirmed that the issue is reproduced in C program. https://pastebin.com/5NaCM45G <https://pastebin.com/5NaCM45G>

Sorry for bothering you...
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: regex [X-Z] with non-ascii char returns different results from (X|Y|Z)

"yury.t" <[hidden email]> writes:

> Thank you for your reply.
> I confirmed that the issue is reproduced in C program. https://pastebin.com/5NaCM45G <https://pastebin.com/5NaCM45G>
>
> Sorry for bothering you...

I'm not sure, but it might be a glibc bug. Since we are already using
glib, maybe we should use

      https://developer.gnome.org/glib/stable/glib-Perl-compatible-regular-expressions.html

I don't know if it also has this problem with [] and non-ascii
characters.

d
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Tomi Ollila-2 Tomi Ollila-2
Reply | Threaded
Open this post in threaded view
|

Re: regex [X-Z] with non-ascii char returns different results from (X|Y|Z)

On Thu, Aug 22 2019, David Bremner wrote:

> "yury.t" <[hidden email]> writes:
>
>> Thank you for your reply.
>> I confirmed that the issue is reproduced in C program. https://pastebin.com/5NaCM45G <https://pastebin.com/5NaCM45G>
>>
>> Sorry for bothering you...
>
> I'm not sure, but it might be a glibc bug. Since we are already using
> glib, maybe we should use
>
>       https://developer.gnome.org/glib/stable/glib-Perl-compatible-regular-expressions.html
>
> I don't know if it also has this problem with [] and non-ascii
> characters.

Since pcre2 supports \K that gives positive vibe about the above

( 'Resetting the match start'
  in http://www.pcre.org/current/doc/html/pcre2pattern.html )

Tomi

>
> d
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
yury.t yury.t
Reply | Threaded
Open this post in threaded view
|

Re: regex [X-Z] with non-ascii char returns different results from (X|Y|Z)

In reply to this post by David Bremner-2
Although this thread now might be offtopic, let me send a follow-up.
By searching with C related terms, I found some articles about this issue.  It seems to be a common problem on regex + multibyte in C.  (e.g. https://stackoverflow.com/a/15895746 <https://stackoverflow.com/a/15895746>)

On Wed, Aug 21, 2019 at 12:58:04PM +0000, [hidden email] <mailto:[hidden email]> wrote:
> - [1] (U+FF11) is treated as [\x{F000}-\x{FFFF}]

Actually, it becomes [\xef\xbc\x91].  That's why it matches with U+Fxxx (starts with \xef in UTF-8).  And without ^, it matches partial byte of a character, U+4444 (\xe4\x91\x84), U+5C11 (\xeb\xb0\x91) for example.

I'm not familiar with C and don't know whether pcre or \k solve this issue, but it might hard to fix if the root cause is how C handles multibyte strings.
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch