Handling mislabeled emails encoded with Windows-1252

classic Classic list List threaded Threaded
12 messages Options
Sebastian Poeplau Sebastian Poeplau
Reply | Threaded
Open this post in threaded view
|

Handling mislabeled emails encoded with Windows-1252

Hi,

This email is to suggest a minor change in how notmuch handles text
encoding when displaying emails. The motivation is the following: I keep
receiving emails that are encoded with Windows-1252 but claim to be
ISO 8859-1. The two character sets only differ in the range between 0x80
and 0x9F where Windows-1252 contains special characters (e.g. “quotation
marks”) while ISO 8859-1 only has non-printable ones. The mislabeling
thus causes some special characters in such emails to be displayed with
a replacement symbol for non-printable characters.

Of course, it would be best to fix the problem on the sender's side,
making their mail client declare the encoding correctly. However,
sometimes this is just not possible and we need to make do with what we
receive. The change I would thus like to suggest is to always treat
ISO 8859-1 as Windows-1252; since the former only contains non-printable
characters in the range where the two differ, we would not lose any
printable information. According to Wikipedia, this substitution is
common in email clients and browsers because of the frequent
mislabeling [1].

Attached you find a simple patch that illustrates my suggestion. While
it works well for my limited use cases, it's obviously not entirely
reliable. Does anyone have a good idea how to better handle the issue? I
searched GMime for related functionality but didn't quite find what I
was looking for. Do you feel that the issue should be raised with the
GMime people instead?

Best regards,
Sebastian

[1] https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Windows-1252


diff -ura notmuch-0.27/notmuch-show.c notmuch-0.27-patched/notmuch-show.c
--- notmuch-0.27/notmuch-show.c 2018-06-13 03:42:34.000000000 +0200
+++ notmuch-0.27-patched/notmuch-show.c 2018-07-11 10:32:56.000456518 +0200
@@ -291,6 +291,8 @@
     charset = g_mime_object_get_content_type_parameter (part, "charset");
     if (charset) {
  GMimeFilter *charset_filter;
+ if (!strcmp(charset, "iso-8859-1"))
+    charset = "CP1252";
  charset_filter = g_mime_filter_charset_new (charset, "UTF-8");
  /* This result can be NULL for things like "unknown-8bit".
  * Don't set a NULL filter as that makes GMime print

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: Handling mislabeled emails encoded with Windows-1252

Sebastian Poeplau <[hidden email]> writes:

> Hi,
>
> This email is to suggest a minor change in how notmuch handles text
> encoding when displaying emails. The motivation is the following: I keep
> receiving emails that are encoded with Windows-1252 but claim to be
> ISO 8859-1. The two character sets only differ in the range between 0x80
> and 0x9F where Windows-1252 contains special characters (e.g. “quotation
> marks”) while ISO 8859-1 only has non-printable ones. The mislabeling
> thus causes some special characters in such emails to be displayed with
> a replacement symbol for non-printable characters.

Hi Sebastian;

Everyone's mail situation is unique, but I haven't noticed this
problem. Do you have a mechanical (e.g. scripted) way of detecting such
mails? I suppose it could just look for characters in the range 0x80 to
0x95 in allegedly ISO_8859-1 messages. A census of the situation in my
own mail would help me think about this problem, I think.

David


_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Sebastian Poeplau Sebastian Poeplau
Reply | Threaded
Open this post in threaded view
|

Re: Handling mislabeled emails encoded with Windows-1252

Hi David,

> Everyone's mail situation is unique, but I haven't noticed this
> problem. Do you have a mechanical (e.g. scripted) way of detecting such
> mails? I suppose it could just look for characters in the range 0x80 to
> 0x95 in allegedly ISO_8859-1 messages. A census of the situation in my
> own mail would help me think about this problem, I think.

Yes, I guess that should be a good enough heuristic for detecting
affected mail. I'll try to come up with a simple script and post it
here.

Cheers,
Sebastian
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Sebastian Poeplau Sebastian Poeplau
Reply | Threaded
Open this post in threaded view
|

Re: Handling mislabeled emails encoded with Windows-1252

Hi again,

>> Everyone's mail situation is unique, but I haven't noticed this
>> problem. Do you have a mechanical (e.g. scripted) way of detecting such
>> mails? I suppose it could just look for characters in the range 0x80 to
>> 0x95 in allegedly ISO_8859-1 messages. A census of the situation in my
>> own mail would help me think about this problem, I think.
>
> Yes, I guess that should be a good enough heuristic for detecting
> affected mail. I'll try to come up with a simple script and post it
> here.

Attached is a Python script that checks individual message files and
prints their name if it finds them to contain mislabeled Windows-1252
text. The heuristic seems to work well on my mail - let me know if you
encounter any issues!

Cheers,
Sebastian



_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch

find_mislabeled_cp1252.py (1K) Download Attachment
Jeffrey Stedfast-2 Jeffrey Stedfast-2
Reply | Threaded
Open this post in threaded view
|

Re: Handling mislabeled emails encoded with Windows-1252

In reply to this post by David Bremner-2
Hi all (sent his to David already using Reply instead of Reply-All, d'oh!),

GMime actually comes with a stream filter (GMimeFilterWindows) which can auto-detect this situation.

In this particular case, you'd instantiate the GMimeFilterWindows like this:

filter = g_mime_filter_windows_new ("iso-8859-1");

"iso-8859-1" being the charset that the content claims to be in.

Then you'd pipe the raw (decoded but not converted to utf-8) content though the filter and afterward call g_mime_filter_windows_real_charset (filter) which would return, in this user's case,  "windows-1252".

Hope that helps,

Jeff

On 7/23/18, 9:49 PM, "notmuch on behalf of David Bremner" <[hidden email] on behalf of [hidden email]> wrote:

    Sebastian Poeplau <[hidden email]> writes:
   
    > Hi,
    >
    > This email is to suggest a minor change in how notmuch handles text
    > encoding when displaying emails. The motivation is the following: I keep
    > receiving emails that are encoded with Windows-1252 but claim to be
    > ISO 8859-1. The two character sets only differ in the range between 0x80
    > and 0x9F where Windows-1252 contains special characters (e.g. “quotation
    > marks”) while ISO 8859-1 only has non-printable ones. The mislabeling
    > thus causes some special characters in such emails to be displayed with
    > a replacement symbol for non-printable characters.
   
    Hi Sebastian;
   
    Everyone's mail situation is unique, but I haven't noticed this
    problem. Do you have a mechanical (e.g. scripted) way of detecting such
    mails? I suppose it could just look for characters in the range 0x80 to
    0x95 in allegedly ISO_8859-1 messages. A census of the situation in my
    own mail would help me think about this problem, I think.
   
    David
   
   
    _______________________________________________
    notmuch mailing list
    [hidden email]
    https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotmuchmail.org%2Fmailman%2Flistinfo%2Fnotmuch&amp;data=02%7C01%7Cjestedfa%40microsoft.com%7C196f62f02155461e6e2408d5f107b75f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636679937804456911&amp;sdata=bI6deYOaU81RwBFmITjg3G1DPvjgP8xiO5cB%2FKIkz58%3D&amp;reserved=0
   

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Sebastian Poeplau Sebastian Poeplau
Reply | Threaded
Open this post in threaded view
|

Re: Handling mislabeled emails encoded with Windows-1252

Hi Jeff,

> GMime actually comes with a stream filter (GMimeFilterWindows) which can auto-detect this situation.
>
> In this particular case, you'd instantiate the GMimeFilterWindows like this:
>
> filter = g_mime_filter_windows_new ("iso-8859-1");
>
> "iso-8859-1" being the charset that the content claims to be in.
>
> Then you'd pipe the raw (decoded but not converted to utf-8) content though the filter and afterward call g_mime_filter_windows_real_charset (filter) which would return, in this user's case,  "windows-1252".

Nice, this is exactly what I was looking for! Somehow I missed it when
checking GMime. I'll adapt my local fix and post the results here.

Thanks,
Sebastian
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Sebastian Poeplau Sebastian Poeplau
Reply | Threaded
Open this post in threaded view
|

Re: Handling mislabeled emails encoded with Windows-1252

Hi all,

Here's the updated patch. It filters the message through the
GMimeFilterWindows that Jeff mentioned and then uses the charset it
detects for GMimeFilterCharset in the actual rendering of the message.

Jeff, is this how to use the filter correctly?

Cheers,
Sebastian



diff -ura notmuch-0.27/notmuch-show.c notmuch-0.27-patched/notmuch-show.c
--- notmuch-0.27/notmuch-show.c 2018-06-13 03:42:34.000000000 +0200
+++ notmuch-0.27-patched/notmuch-show.c 2018-07-28 10:25:25.358502880 +0200
@@ -271,7 +271,10 @@
 {
     GMimeContentType *content_type = g_mime_object_get_content_type (GMIME_OBJECT (part));
     GMimeStream *stream_filter = NULL;
+    GMimeStream *null_stream = NULL;
+    GMimeStream *null_stream_filter = NULL;
     GMimeFilter *crlf_filter = NULL;
+    GMimeFilter *windows_filter = NULL;
     GMimeDataWrapper *wrapper;
     const char *charset;
 
@@ -282,13 +285,27 @@
     if (stream_out == NULL)
  return;
 
+    charset = g_mime_object_get_content_type_parameter (part, "charset");
+    wrapper = g_mime_part_get_content_object (GMIME_PART (part));
+    if (wrapper && charset) {
+ /* Check for mislabeled Windows encoding */
+ null_stream = g_mime_stream_null_new ();
+ null_stream_filter = g_mime_stream_filter_new (null_stream);
+ windows_filter = g_mime_filter_windows_new (charset);
+ g_mime_stream_filter_add(GMIME_STREAM_FILTER (null_stream_filter),
+ windows_filter);
+ g_mime_data_wrapper_write_to_stream (wrapper, null_stream_filter);
+ charset = g_mime_filter_windows_real_charset(
+    (GMimeFilterWindows *) windows_filter);
+ g_object_unref (windows_filter);
+    }
+
     stream_filter = g_mime_stream_filter_new (stream_out);
     crlf_filter = g_mime_filter_crlf_new (false, false);
     g_mime_stream_filter_add(GMIME_STREAM_FILTER (stream_filter),
      crlf_filter);
     g_object_unref (crlf_filter);
 
-    charset = g_mime_object_get_content_type_parameter (part, "charset");
     if (charset) {
  GMimeFilter *charset_filter;
  charset_filter = g_mime_filter_charset_new (charset, "UTF-8");
@@ -313,9 +330,12 @@
  }
     }
 
-    wrapper = g_mime_part_get_content_object (GMIME_PART (part));
     if (wrapper && stream_filter)
  g_mime_data_wrapper_write_to_stream (wrapper, stream_filter);
+    if (null_stream_filter)
+ g_object_unref (null_stream_filter);
+    if (null_stream)
+ g_object_unref (null_stream);
     if (stream_filter)
  g_object_unref(stream_filter);
 }




Sebastian Poeplau <[hidden email]> writes:

> Hi Jeff,
>
>> GMime actually comes with a stream filter (GMimeFilterWindows) which can auto-detect this situation.
>>
>> In this particular case, you'd instantiate the GMimeFilterWindows like this:
>>
>> filter = g_mime_filter_windows_new ("iso-8859-1");
>>
>> "iso-8859-1" being the charset that the content claims to be in.
>>
>> Then you'd pipe the raw (decoded but not converted to utf-8) content though the filter and afterward call g_mime_filter_windows_real_charset (filter) which would return, in this user's case,  "windows-1252".
>
> Nice, this is exactly what I was looking for! Somehow I missed it when
> checking GMime. I'll adapt my local fix and post the results here.
>
> Thanks,
> Sebastian

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Jeffrey Stedfast-2 Jeffrey Stedfast-2
Reply | Threaded
Open this post in threaded view
|

Re: Handling mislabeled emails encoded with Windows-1252

Hi Sebastien,

Yes, that looks good. I would have probably unreffed the null_stream and null_stream_filter inside of that if-block rather than at the end of the function, but that's a stylistic issue that the notmuch authors can comment on. The patch as it stands should work correctly from what I can tell __

As an added optimization, you could try limiting that block of code to just when the charset is one of the iso-8859-* charsets.

The following code snippet should help with that:

charset = charset ? g_mime_charset_canon_name (charset) : NULL;
if (wrapper && charset && g_ascii_strncasecmp (charset, "iso-8859-", 9)) {
    ...

The reason you need to use g_mime_charset_canon_name (if you decide to add the optimization) is that mail software does not always use the canonical form of the various charset names that they use. Often you will get stuff like "latin1" or "iso_8859-1".

Hope that helps,

Jeff

On 7/28/18, 7:22 AM, "Sebastian Poeplau" <[hidden email]> wrote:

    Hi all,
   
    Here's the updated patch. It filters the message through the
    GMimeFilterWindows that Jeff mentioned and then uses the charset it
    detects for GMimeFilterCharset in the actual rendering of the message.
   
    Jeff, is this how to use the filter correctly?
   
    Cheers,
    Sebastian
   
   
   

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Sebastian Poeplau Sebastian Poeplau
Reply | Threaded
Open this post in threaded view
|

Re: Handling mislabeled emails encoded with Windows-1252

Hi,

> Yes, that looks good. I would have probably unreffed the null_stream
> and null_stream_filter inside of that if-block rather than at the end
> of the function, but that's a stylistic issue that the notmuch authors
> can comment on. The patch as it stands should work correctly from what
> I can tell __

I was worried about the string returned by
g_mime_filter_windows_real_charset: once I unref everything, isn't there
a risk of the filter being deleted? As far as I can tell from the code,
the returned charset might be a pointer into the filter object...

> As an added optimization, you could try limiting that block of code to
> just when the charset is one of the iso-8859-* charsets.
>
> The following code snippet should help with that:
>
> charset = charset ? g_mime_charset_canon_name (charset) : NULL;
> if (wrapper && charset && g_ascii_strncasecmp (charset, "iso-8859-", 9)) {
>     ...
>
> The reason you need to use g_mime_charset_canon_name (if you decide to
> add the optimization) is that mail software does not always use the
> canonical form of the various charset names that they use. Often you
> will get stuff like "latin1" or "iso_8859-1".

Nice, I'll add it.

Thanks a lot,
Sebastian
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Sebastian Poeplau Sebastian Poeplau
Reply | Threaded
Open this post in threaded view
|

Re: Handling mislabeled emails encoded with Windows-1252

Hi,

>> As an added optimization, you could try limiting that block of code to
>> just when the charset is one of the iso-8859-* charsets.
>>
>> The following code snippet should help with that:
>>
>> charset = charset ? g_mime_charset_canon_name (charset) : NULL;
>> if (wrapper && charset && g_ascii_strncasecmp (charset, "iso-8859-", 9)) {
>>     ...
>>
>> The reason you need to use g_mime_charset_canon_name (if you decide to
>> add the optimization) is that mail software does not always use the
>> canonical form of the various charset names that they use. Often you
>> will get stuff like "latin1" or "iso_8859-1".
>
> Nice, I'll add it.
Updated patch attached.

Cheers,
Sebastian



diff -ura notmuch-0.27/notmuch-show.c notmuch-0.27-patched/notmuch-show.c
--- notmuch-0.27/notmuch-show.c 2018-06-13 03:42:34.000000000 +0200
+++ notmuch-0.27-patched/notmuch-show.c 2018-07-30 09:41:05.491636418 +0200
@@ -272,6 +272,7 @@
     GMimeContentType *content_type = g_mime_object_get_content_type (GMIME_OBJECT (part));
     GMimeStream *stream_filter = NULL;
     GMimeFilter *crlf_filter = NULL;
+    GMimeFilter *windows_filter = NULL;
     GMimeDataWrapper *wrapper;
     const char *charset;
 
@@ -282,13 +283,37 @@
     if (stream_out == NULL)
  return;
 
+    charset = g_mime_object_get_content_type_parameter (part, "charset");
+    charset = charset ? g_mime_charset_canon_name (charset) : NULL;
+    wrapper = g_mime_part_get_content_object (GMIME_PART (part));
+    if (wrapper && charset && !g_ascii_strncasecmp (charset, "iso-8859-", 9)) {
+ GMimeStream *null_stream = NULL;
+ GMimeStream *null_stream_filter = NULL;
+
+ /* Check for mislabeled Windows encoding */
+ null_stream = g_mime_stream_null_new ();
+ null_stream_filter = g_mime_stream_filter_new (null_stream);
+ windows_filter = g_mime_filter_windows_new (charset);
+ g_mime_stream_filter_add(GMIME_STREAM_FILTER (null_stream_filter),
+ windows_filter);
+ g_mime_data_wrapper_write_to_stream (wrapper, null_stream_filter);
+ charset = g_mime_filter_windows_real_charset(
+    (GMimeFilterWindows *) windows_filter);
+
+ if (null_stream_filter)
+    g_object_unref (null_stream_filter);
+ if (null_stream)
+    g_object_unref (null_stream);
+ /* Keep a reference to windows_filter in order to prevent the
+ * charset string from deallocation. */
+    }
+
     stream_filter = g_mime_stream_filter_new (stream_out);
     crlf_filter = g_mime_filter_crlf_new (false, false);
     g_mime_stream_filter_add(GMIME_STREAM_FILTER (stream_filter),
      crlf_filter);
     g_object_unref (crlf_filter);
 
-    charset = g_mime_object_get_content_type_parameter (part, "charset");
     if (charset) {
  GMimeFilter *charset_filter;
  charset_filter = g_mime_filter_charset_new (charset, "UTF-8");
@@ -313,11 +338,12 @@
  }
     }
 
-    wrapper = g_mime_part_get_content_object (GMIME_PART (part));
     if (wrapper && stream_filter)
  g_mime_data_wrapper_write_to_stream (wrapper, stream_filter);
     if (stream_filter)
  g_object_unref(stream_filter);
+    if (windows_filter)
+ g_object_unref (windows_filter);
 }
 
 static const char*

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: Handling mislabeled emails encoded with Windows-1252

Sebastian Poeplau <[hidden email]> writes:

>> Nice, I'll add it.
>
> Updated patch attached.
>
> Cheers,
> Sebastian

Thanks to both of you for working on this. The code looks ok to me, I
have only some procedural comments.

In order to merge it I'll need at least one test. I think
test/T300-encoding.sh is probably the right place. There are a few
different styles of test; you can either put things in variables as in
that file, or use the more dominant

test_subtest_begin_test "description"
cat << EOF > EXPECTED
this is my expected output
EOF
notmuch show STUFF > OUTPUT
test_expect_equal_file EXPECTED OUTPUT

Feel free to bug the list for help on making tests (or #notmuch on
freenode).

Please also use git-send-email to send your patch(es), with commit
messages with an eye to

         https://notmuchmail.org/contributing/#index5h2

To minimize the chance of problems, it's probably best to base your
commits on master, although the patch you sent applied fine here.

Thanks,

David


_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Sebastian Poeplau Sebastian Poeplau
Reply | Threaded
Open this post in threaded view
|

Re: Handling mislabeled emails encoded with Windows-1252

Hi David,

Thanks for the hints! I'll prepare a test and the patch based on master
shortly.

Cheers,
Sebastian


David Bremner <[hidden email]> writes:

> Sebastian Poeplau <[hidden email]> writes:
>
>>> Nice, I'll add it.
>>
>> Updated patch attached.
>>
>> Cheers,
>> Sebastian
>
> Thanks to both of you for working on this. The code looks ok to me, I
> have only some procedural comments.
>
> In order to merge it I'll need at least one test. I think
> test/T300-encoding.sh is probably the right place. There are a few
> different styles of test; you can either put things in variables as in
> that file, or use the more dominant
>
> test_subtest_begin_test "description"
> cat << EOF > EXPECTED
> this is my expected output
> EOF
> notmuch show STUFF > OUTPUT
> test_expect_equal_file EXPECTED OUTPUT
>
> Feel free to bug the list for help on making tests (or #notmuch on
> freenode).
>
> Please also use git-send-email to send your patch(es), with commit
> messages with an eye to
>
>          https://notmuchmail.org/contributing/#index5h2
>
> To minimize the chance of problems, it's probably best to base your
> commits on master, although the patch you sent applied fine here.
>
> Thanks,
>
> David
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch