accented characters

classic Classic list List threaded Threaded
4 messages Options
Bruno Deremble Bruno Deremble
Reply | Threaded
Open this post in threaded view
|

accented characters


Hi,
I am still new to notmuch and keep experimenting it; a lot of very
interesting features.
I realized that searching "été" and "ete" do not give the same answer
which may be confusing in some situation (in case the sender has an
accented name and may or may not sign his email with his accented name)

A way to handle this could be to only index non accented words which
requires to add a filter before the indexing process. I looked at the code
and it seems that this should be handled by gmime?
there are also libraries that are supposed to do that such as 'unac'.

Is it something that you have been exploring already?

thank you
bruno
 
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: accented characters

Bruno Deremble <[hidden email]> writes:

> A way to handle this could be to only index non accented words which
> requires to add a filter before the indexing process. I looked at the code
> and it seems that this should be handled by gmime?
> there are also libraries that are supposed to do that such as 'unac'.
>
> Is it something that you have been exploring already?

We have discussed it a bit (with another francophone, in copy ;) ), but
I think no-one got very far.

I guess the ideal case would be to have the possibility of for both
accented and accent free search. That would require adding some more
terms to the index (both accented and unaccented version). It's not
clear to me yet what kind of performance impact that would have.

Xapian already has something called "stemmers" (in xapian-core/languages
in the source tree), which do, among other things, strip accents. Those
are generally targetted at a single language, which I suspect is not
very useful for notmuch (even I as a mostly-unilingual person have a
fair amount of English, French, and German in my mailstore). Nonetheless
a custom stemmer might be the right way to go, since that step is
happening anyway.  Or perhaps people would be happy enough with being
able to set the stemmer (currently it is hardcoded to English). That
would be a relatively easy change to notmuch, but I don't know how many
people would find it a good tradeoff to lose English stemming
(i.e. search for 'stem' and 'stemming' being equivalant) for
de-accenting.

I'm not sure if the query language would need to support the
distinction between accented and unaccented searches. I imagine that
people naturally type the non-accented versions in a search, but I do
wonder about cases like (German) München. Should that be stemmed to
Munchen or Muenchen ?

The other thing I don't know is how many people would be happy with just
stripping all accents. That could be done in a gmime filter, as you
suggest. That would be more likely to require changes to the query
language. Off hand I don't know how to transparently de-accent all query
words.

d






_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch

signature.asc (671 bytes) Download Attachment
Stefano Zacchiroli-2 Stefano Zacchiroli-2
Reply | Threaded
Open this post in threaded view
|

Re: accented characters

On Mon, Nov 13, 2017 at 09:22:36AM -0400, David Bremner wrote:
> The other thing I don't know is how many people would be happy with just
> stripping all accents. That could be done in a gmime filter, as you
> suggest. That would be more likely to require changes to the query
> language. Off hand I don't know how to transparently de-accent all query
> words.

My gut feeling is that removing accents by default from both the terms
in the index and user queries would go a long way in addressing this
problem. Especially so if it's a boolean option in notmuch config (which
default to stripping accents).

As a random example/data point, chromium does that and when you search
unaccented strings in a web page will find any combination of them with
accents. Is, by far, my best UX experience w.r.t. accents on GNU/Linux.

Unicode has a notion of canonical form that rearrange accented
characters in a sequence of non-accented characters + modifiers
https://en.wikipedia.org/wiki/Unicode_equivalence . A bunch of libraries
use that stuff to normalize-away accents in unicode strings. I'm aware
of a few in Python for instance, but not in C++ (which I believe is what
you'd be interested in).

HTH,
--
Stefano Zacchiroli . [hidden email] . upsilon.cc/zack . . o . . . o . o
Computer Science Professor . CTO Software Heritage . . . . . o . . . o o
Former Debian Project Leader & OSI Board Director  . . . o o o . . . o .
« the first rule of tautology club is the first rule of tautology club »
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: accented characters

Stefano Zacchiroli <[hidden email]> writes:

>
> Unicode has a notion of canonical form that rearrange accented
> characters in a sequence of non-accented characters + modifiers
> https://en.wikipedia.org/wiki/Unicode_equivalence . A bunch of libraries
> use that stuff to normalize-away accents in unicode strings. I'm aware
> of a few in Python for instance, but not in C++ (which I believe is what
> you'd be interested in).
>

Apropos, Rob Browning started looking at canonicalization using glib

in

        id:[hidden email]
        http://article.gmane.org/gmane.mail.notmuch.general/21004
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch