locales and notmuch

classic Classic list List threaded Threaded
6 messages Options
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

locales and notmuch


So I've been revisiting the "user defined headers" [1] patches. I need
the <prefix> in

    $ notmuch config set index.header.<prefix> "blah"

to be unique case-insensitively, so I decided to convert them to lower
case on input. This turns out to be "fun", if we try to handle things
other than ASCII.  So one option is to just insist prefixes are ASCII.

Otherwise we could insist they are UTF-8, ignoring the locale. The
fullest generality (I think) is to first convert from the users locale
to utf8, as in the attached sample program. The gotcha is that the call
to setlocale is necessary, and can't really be local to string utility
function. So we'd have to add that to notmuch startup. We mostly ignore
locales, so I guess there shouldn't be too much side effects; otoh I
don't have much experience with locales.

So what do people think? ASCII? UTF-8? Locale sensitivitie?

[1] id:[hidden email]


#include <stdio.h>
#include <glib.h>
#include <gmodule.h>
#include <locale.h>

int
main (int argc, char **argv)
{
  gchar *utf8_str, *lc_str;
  GError *err = NULL;

  setlocale(LC_ALL,"");
  utf8_str = g_locale_to_utf8 ("Sn☃man",-1,NULL,NULL,&err);

  if (!utf8_str) {
    fprintf(stderr, "%s\n", err->message);
    abort();
  }

  lc_str = g_utf8_strdown (utf8_str, -1);

  printf ("%s\n", lc_str);
}

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: locales and notmuch

David Bremner <[hidden email]> writes:

> Otherwise we could insist they are UTF-8, ignoring the locale. The
> fullest generality (I think) is to first convert from the users locale
> to utf8, as in the attached sample program. The gotcha is that the call
> to setlocale is necessary, and can't really be local to string utility
> function. So we'd have to add that to notmuch startup. We mostly ignore
> locales, so I guess there shouldn't be too much side effects; otoh I
> don't have much experience with locales.
>

1) It might be possible to save and restore the locale, although that
sounds a bit heavy weight for lowercasing a string.

2) We'd need a UTF-8 locale to test in. I guess C.UTF-8 is not yet
universally available.

d


_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Matt Armstrong Matt Armstrong
Reply | Threaded
Open this post in threaded view
|

Re: locales and notmuch

David Bremner <[hidden email]> writes:

> David Bremner <[hidden email]> writes:
>
>> Otherwise we could insist they are UTF-8, ignoring the locale. The
>> fullest generality (I think) is to first convert from the users locale
>> to utf8, as in the attached sample program. The gotcha is that the call
>> to setlocale is necessary, and can't really be local to string utility
>> function. So we'd have to add that to notmuch startup. We mostly ignore
>> locales, so I guess there shouldn't be too much side effects; otoh I
>> don't have much experience with locales.
>>
>
> 1) It might be possible to save and restore the locale, although that
> sounds a bit heavy weight for lowercasing a string.
>
> 2) We'd need a UTF-8 locale to test in. I guess C.UTF-8 is not yet
> universally available.

Notmuch should probably adopt a coherent strategy with respect to
character set encodings, rather than do something ad-hoc for the
feature.  Most systems I have worked with normalize to UTF-8 at the
edges and do all work using that encoding.

It is an interesting question: what encoding does .notmuch-config use?
UTF-8?  User's choice?  Similarly, what is the encoding of notmuch's
command line args?

I was just reading https://xapian.org/features and Xapian seems to store
text in UTF-8.  If this is the case, where is the code that does the
charset conversions between the email messages and UTF-8?  How about
between the command line args to UTF-8?
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: locales and notmuch

Matt Armstrong <[hidden email]> writes:

>
> Notmuch should probably adopt a coherent strategy with respect to
> character set encodings, rather than do something ad-hoc for the
> feature.  Most systems I have worked with normalize to UTF-8 at the
> edges and do all work using that encoding.
>

You're probably correct. On the other hand, lack of locale handling is not
something that people actually complain about very much. So if we do
decide to "Do the right thing", then I'd probably just continue ignoring
the problem, rather than block working on things that do annoy people.

> It is an interesting question: what encoding does .notmuch-config use?
> UTF-8?  User's choice?

It's loaded by g_key_file_load_from_data; I suspect that does no conversion.

> Similarly, what is the encoding of notmuch's
> command line args?

There is no conversion done.

In both these cases it probably works mostly OK for people (at least
nobody complained) because user values are treated as opaque null
terminated byte sequences.

> I was just reading https://xapian.org/features and Xapian seems to store
> text in UTF-8.  If this is the case, where is the code that does the
> charset conversions between the email messages and UTF-8?

I'd have to double check the code to be sure, but I suspect this is done
by GMime when parsing the files.

> How about
> between the command line args to UTF-8?

AFAIR, there is no conversion, and search terms are passed straight to
Xapian.

This probably doesn't work well for people with non-UTF-8 locales.
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Daniel Kahn Gillmor Daniel Kahn Gillmor
Reply | Threaded
Open this post in threaded view
|

Re: locales and notmuch

In reply to this post by David Bremner-2
(sorry for the late reply to this thread)

On Thu 2019-02-21 15:11:48 -0400, David Bremner wrote:
> to be unique case-insensitively, so I decided to convert them to lower
> case on input. This turns out to be "fun", if we try to handle things
> other than ASCII.  So one option is to just insist prefixes are ASCII.
>
> Otherwise we could insist they are UTF-8, ignoring the locale. The
> fullest generality (I think) is to first convert from the users locale
> to utf8, as in the attached sample program.

I don't think this discussion fully covers just how "fun" this
conversion is.

Even if we assume UTF-8 in the database (which i think we should),
making something all lower-case is locale-dependent.  The classic
example, iirc, is that in most UTF-8 locales, U+0049 LATIN CAPITAL
LETTER I downcases to U+0069 LATIN SMALL LETTER I, but in tr_TR
(Turkish), it downcases to U+0131 LATIN SMALL LETTER DOTLESS I.  (and
upper-casing U+0069 LATIN SMALL LETTER I in tr_TR yields U+0130 LATIN
CAPITAL LETTER I WITH DOT ABOVE)

Similarly, if there's anything that the DB cares about collation for,
that also varies dramatically across UTF-8 locales.

sigh.

I have no problem with asserting that all character strings in the
notmuch database are UTF-8.  That's just the only sane thing to do in
2019.  But if we build any feature into notmuch that makes assumptions
or requirements about upper-casing, lower-casing, or collating strings,
and that feature interacts between the currently-running locale and
whatever locale was used to store data in the the database in the past,
and those locales can differ, we may be inflicting some subtle pain on
users.

(note that i'm assuming in this discussion that we're *just* talking
about metadata -- notmuch configuration options, explicit xapian terms,
etc, but *not* the indexed text of the messages, which is an entirely
different kettle of fish)

I see two protective approaches for handling this simply yet being clear
about our concerns.  Both methods introduce a clear dependency on some
UTF-8 locale, in the way that we also have clear dependencies on GMime
or Xapian.

 a) assert that all text strings in the notmuch db's metadata are
    C.UTF-8, and enforce this explicitly in the codebase.

or,

 b) upon database initialization, select a UTF-8 locale (probably based
    on the user's locale during "notmuch setup") and store it in the
    database (perhaps reporting and displaying it via a "notmuch config"
    value).  If any locale-dependent function is used against
    in-database metadata while a *different* locale is active in the
    environment, warn that this mismatch is happening, and prefer the
    locale stored in the db.

I don't have the capacity to work on this kind of safeguard right now,
but someone who wants to learn more about locales and notmuch could try
to implement it and we could see what happens.  Being explicit about the
concern like this might help to raise the profile of the specific risky
codepaths, which in turn could prompt someone to make a more
sophisticated and useful fix than either of the guardrails described
above.

        --dkg

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch

signature.asc (233 bytes) Download Attachment
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: locales and notmuch

Daniel Kahn Gillmor <[hidden email]> writes:

> (sorry for the late reply to this thread)
>
> On Thu 2019-02-21 15:11:48 -0400, David Bremner wrote:
>> to be unique case-insensitively, so I decided to convert them to lower
>> case on input. This turns out to be "fun", if we try to handle things
>> other than ASCII.  So one option is to just insist prefixes are ASCII.
>>

> I have no problem with asserting that all character strings in the
> notmuch database are UTF-8.  That's just the only sane thing to do in
> 2019.  But if we build any feature into notmuch that makes assumptions
> or requirements about upper-casing, lower-casing, or collating strings,
> and that feature interacts between the currently-running locale and
> whatever locale was used to store data in the the database in the past,
> and those locales can differ, we may be inflicting some subtle pain on
> users.

I eventually settled on 4b9c03efc, which will probably do strange thing
to people who define non-ascii prefix names in non-utf8 locales. I'm OK
atm with just saying that is unsupported.

d
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch