subjects and duplicated message id's

classic Classic list List threaded Threaded
9 messages Options
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

subjects and duplicated message id's


There are currently several somewhat related issues with notmuch's
handling of subject headers for messages with duplicate message-ids
(i.e. several files on disk with the same message id).  These are all
reflections of the fact that we use a value slot for subjects in the
database message document (i.e. the database object keyed by the
message-id).  Among other things, using a value slot is what makes
regular expression searching (and potentially sorting) by subject work.

When we have multiple files with the same message-id, but different
subjects (probably indicating a "real" mid collision).

1. The output of notmuch-show can be inconsistent with notmuch-search

   - this is because show is reading from the lexicographically first
     file, while show is reading the database value slot.

   - in principle this could be fixed by making show read the value
     slot; but then the subject might not be consistent with the rest of
     the message content. Also, it looks like a bit of a pain to refactor
     so all that sprinter code has database access.

   - we could also force the value slot to have the lexico first files'
     subject during indexing. This would be a bit fiddly, but localized.
     It would have the surprising effect of having the subject updated
     when new messages arrived.

2. Regular expression search doesn't work for subjects not in the value
   slot.

   - this could be fixed by putting all subjects in the value slot,
     perhaps as newline seperated strings. This would also be a
     potential solution for the "subject hiding" issue mentioned above,
     although it would take some front-end effort as well to deal with
     "multi-subjects".  This could be reported in e.g. json output as an
     array of subjects.

I'm open to other, better ideas of how to do this. I'm also curious how
important people think these bugs are.

d
           
     
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

[PATCH] test: add known broken test for regexp search of second subject

We expect this to give the same answer as the non-regexp subject
search. It does not because the regexp search relies on the value
slot, which currently contains only one subject.
---
 test/T670-duplicate-mid.sh | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/test/T670-duplicate-mid.sh b/test/T670-duplicate-mid.sh
index c198c506..bf8cc3a8 100755
--- a/test/T670-duplicate-mid.sh
+++ b/test/T670-duplicate-mid.sh
@@ -47,6 +47,16 @@ EOF
 notmuch search --output=files subject:'"message 2"' | notmuch_dir_sanitize > OUTPUT
 test_expect_equal_file EXPECTED OUTPUT
 
+test_begin_subtest 'Regexp search for second subject'
+test_subtest_known_broken
+cat <<EOF >EXPECTED
+MAIL_DIR/copy0
+MAIL_DIR/copy1
+MAIL_DIR/copy2
+EOF
+notmuch search --output=files 'subject:"/message 2/"' | notmuch_dir_sanitize > OUTPUT
+test_expect_equal_file EXPECTED OUTPUT
+
 add_message '[id]="duplicate"' '[body]="sekrit" [filename]=copy3'
 test_begin_subtest 'search for body in duplicate file'
 cat <<EOF >EXPECTED
--
2.15.0

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Daniel Kahn Gillmor Daniel Kahn Gillmor
Reply | Threaded
Open this post in threaded view
|

Re: subjects and duplicated message id's

In reply to this post by David Bremner-2
On Thu 2017-12-14 10:03:12 -0400, David Bremner wrote:

> There are currently several somewhat related issues with notmuch's
> handling of subject headers for messages with duplicate message-ids
> (i.e. several files on disk with the same message id).  These are all
> reflections of the fact that we use a value slot for subjects in the
> database message document (i.e. the database object keyed by the
> message-id).  Among other things, using a value slot is what makes
> regular expression searching (and potentially sorting) by subject work.
>
> When we have multiple files with the same message-id, but different
> subjects (probably indicating a "real" mid collision).
>
> 1. The output of notmuch-show can be inconsistent with notmuch-search
>
>    - this is because show is reading from the lexicographically first
>      file, while show is reading the database value slot.
you've got two "show"s here.  i think the second "show" is meant to be
"search".

>    - in principle this could be fixed by making show read the value
>      slot; but then the subject might not be consistent with the rest of
>      the message content. Also, it looks like a bit of a pain to refactor
>      so all that sprinter code has database access.
>
>    - we could also force the value slot to have the lexico first files'
>      subject during indexing. This would be a bit fiddly, but localized.
>      It would have the surprising effect of having the subject updated
>      when new messages arrived.

This is a bit weird, unless we also force "notmuch show" to always show
the lexicographically-first file as well, no?

> 2. Regular expression search doesn't work for subjects not in the value
>    slot.
>
>    - this could be fixed by putting all subjects in the value slot,
>      perhaps as newline seperated strings. This would also be a
>      potential solution for the "subject hiding" issue mentioned above,
>      although it would take some front-end effort as well to deal with
>      "multi-subjects".  This could be reported in e.g. json output as an
>      array of subjects.
>
> I'm open to other, better ideas of how to do this. I'm also curious how
> important people think these bugs are.
I think this is important to get right, thanks for raising it. I'll add
my own wrinkle below:

I'm looking at implementing "protected headers" (for recieving messages)
right now, where Subject: is the most important header that is typically
sent under encrypted cover (e.g. enigmail's "memory hole" implementation
does this).

With indexing of cleartext, we have a decision about which thing to put
in the value slot -- the "original" subject (that is, the subject that
the message sender wrote, which arrives inside the message encryption
for messages with protected headers) or the as-delivered external "stub"
header (typically the literal string "encrypted message").

Obviously, i'd prefer the original subject, so that it's searchable.
I'd also like it if "notmuch show" displayed the original subject when
showing the encrypted message.

However, if someone does "notmuch show --decrypt=false" i'd want it to
display the as-delivered header.

Does this help push you toward any specific decision?  I'm also not sure
what the correct solution is right now.

     --dkg

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch

signature.asc (847 bytes) Download Attachment
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: subjects and duplicated message id's


>>    - we could also force the value slot to have the lexico first files'
>>      subject during indexing. This would be a bit fiddly, but localized.
>>      It would have the surprising effect of having the subject updated
>>      when new messages arrived.
>
> This is a bit weird, unless we also force "notmuch show" to always show
> the lexicographically-first file as well, no?

AFAIU, this is the current behaviour, see the second test in T670-duplicate-mid.

> Obviously, i'd prefer the original subject, so that it's searchable.
> I'd also like it if "notmuch show" displayed the original subject when
> showing the encrypted message.
>
> However, if someone does "notmuch show --decrypt=false" i'd want it to
> display the as-delivered header.
>
> Does this help push you toward any specific decision?  I'm also not sure
> what the correct solution is right now.

I don't think it really pushes one way or the other. The --decrypt=false
case could just look at the file instead of the database (that's what
notmuch-show does now.

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

WIP, all subjects in value slot

This is a proof of concept for adding all subjects into the value
slot. It's enough to get fix the regexp search test earlier in the thread.
It doesn't yet sort subjects by filename.

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

[PATCH] WIP: add all subjects to value.

---
 lib/add-message.cc         |  3 +--
 lib/message.cc             | 52 ++++++++++++++++++++++++++++++++++++++++------
 test/T670-duplicate-mid.sh |  1 -
 3 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/lib/add-message.cc b/lib/add-message.cc
index f5fac8be..095a1f37 100644
--- a/lib/add-message.cc
+++ b/lib/add-message.cc
@@ -538,8 +538,7 @@ notmuch_database_index_file (notmuch_database_t *notmuch,
  if (ret)
     goto DONE;
 
- if (is_new || is_ghost)
-    _notmuch_message_set_header_values (message, date, from, subject);
+ _notmuch_message_set_header_values (message, date, from, subject);
 
  if (!indexopts) {
     def_indexopts = notmuch_database_get_default_indexopts (notmuch);
diff --git a/lib/message.cc b/lib/message.cc
index d5db89b6..c624e145 100644
--- a/lib/message.cc
+++ b/lib/message.cc
@@ -26,6 +26,8 @@
 
 #include <gmime/gmime.h>
 
+#include <sstream>
+
 struct _notmuch_message {
     notmuch_database_t *notmuch;
     Xapian::docid doc_id;
@@ -514,8 +516,14 @@ notmuch_message_get_header (notmuch_message_t *message, const char *header)
 
     if (slot != Xapian::BAD_VALUENO) {
  try {
-    std::string value = message->doc.get_value (slot);
-
+    std::string raw = message->doc.get_value (slot);
+    std::string value;
+    if (slot == NOTMUCH_VALUE_SUBJECT) {
+ std::istringstream f(raw);
+ std::getline(f, value);
+    } else {
+ value = raw;
+    }
     /* If we have NOTMUCH_FEATURE_FROM_SUBJECT_ID_VALUES, then
      * empty values indicate empty headers.  If we don't, then
      * it could just mean we didn't record the header. */
@@ -655,6 +663,27 @@ _notmuch_message_remove_indexed_terms (notmuch_message_t *message)
     }
     return NOTMUCH_PRIVATE_STATUS_SUCCESS;
 }
+/* Remove all values from a document; currently these are
+   all regenerated during indexing */
+
+notmuch_private_status_t
+_notmuch_message_remove_values (notmuch_message_t *message)
+{
+    try {
+ message->doc.clear_values ();
+ message->modified = TRUE;
+    } catch (const Xapian::Error &error) {
+ notmuch_database_t *notmuch = message->notmuch;
+
+ if (!notmuch->exception_reported) {
+    _notmuch_database_log(_notmuch_message_database (message), "A Xapian exception occurred creating message: %s\n",
+      error.get_msg().c_str());
+    notmuch->exception_reported = TRUE;
+ }
+ return NOTMUCH_PRIVATE_STATUS_XAPIAN_EXCEPTION;
+    }
+    return NOTMUCH_PRIVATE_STATUS_SUCCESS;
+}
 
 /* Return true if p points at "new" or "cur". */
 static bool is_maildir (const char *p)
@@ -1097,6 +1126,7 @@ _notmuch_message_set_header_values (notmuch_message_t *message,
     const char *subject)
 {
     time_t time_value;
+    std::string old_subject;
 
     /* GMime really doesn't want to see a NULL date, so protect its
      * sensibilities. */
@@ -1114,7 +1144,13 @@ _notmuch_message_set_header_values (notmuch_message_t *message,
     message->doc.add_value (NOTMUCH_VALUE_TIMESTAMP,
     Xapian::sortable_serialise (time_value));
     message->doc.add_value (NOTMUCH_VALUE_FROM, from);
-    message->doc.add_value (NOTMUCH_VALUE_SUBJECT, subject);
+
+    old_subject = message->doc.get_value (NOTMUCH_VALUE_SUBJECT);
+    if (old_subject.empty())
+ message->doc.add_value (NOTMUCH_VALUE_SUBJECT, subject);
+    else
+ message->doc.add_value (NOTMUCH_VALUE_SUBJECT, old_subject + "\n" + subject);
+
     message->modified = true;
 }
 
@@ -1999,6 +2035,12 @@ notmuch_message_reindex (notmuch_message_t *message,
  goto DONE;
     }
 
+    private_status = _notmuch_message_remove_values (message);
+    if (private_status) {
+ ret = COERCE_STATUS(private_status, "error values");
+ goto DONE;
+    }
+
     ret = notmuch_message_remove_all_properties_with_prefix (message, "index.");
     if (ret)
  goto DONE; /* XXX TODO: distinguish from other error returns above? */
@@ -2043,9 +2085,7 @@ notmuch_message_reindex (notmuch_message_t *message,
     thread_id = orig_thread_id;
 
  _notmuch_message_add_term (message, "thread", thread_id);
- /* Take header values only from first filename */
- if (found == 0)
-    _notmuch_message_set_header_values (message, date, from, subject);
+ _notmuch_message_set_header_values (message, date, from, subject);
 
  ret = _notmuch_message_index_file (message, indexopts, message_file);
 
diff --git a/test/T670-duplicate-mid.sh b/test/T670-duplicate-mid.sh
index bf8cc3a8..cfc5dafb 100755
--- a/test/T670-duplicate-mid.sh
+++ b/test/T670-duplicate-mid.sh
@@ -48,7 +48,6 @@ notmuch search --output=files subject:'"message 2"' | notmuch_dir_sanitize > OUT
 test_expect_equal_file EXPECTED OUTPUT
 
 test_begin_subtest 'Regexp search for second subject'
-test_subtest_known_broken
 cat <<EOF >EXPECTED
 MAIL_DIR/copy0
 MAIL_DIR/copy1
--
2.15.1

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] test: add known broken test for regexp search of second subject

In reply to this post by David Bremner-2
David Bremner <[hidden email]> writes:

> We expect this to give the same answer as the non-regexp subject
> search. It does not because the regexp search relies on the value
> slot, which currently contains only one subject.

Pushed to master. I still need to revisit the fix.

d
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Daniel Kahn Gillmor Daniel Kahn Gillmor
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] WIP: add all subjects to value.

In reply to this post by David Bremner-2
I like the ideas behind this patch.  it's labeled WIP, presumably
because it doesn't handle ordering the subject lines, right?

further comments below inline:

On Tue 2017-12-19 09:15:40 -0400, David Bremner wrote:

> +/* Remove all values from a document; currently these are
> +   all regenerated during indexing */
> +
> +notmuch_private_status_t
> +_notmuch_message_remove_values (notmuch_message_t *message)
> +{
> +    try {
> + message->doc.clear_values ();
> + message->modified = TRUE;
> +    } catch (const Xapian::Error &error) {
> + notmuch_database_t *notmuch = message->notmuch;
> +
> + if (!notmuch->exception_reported) {
> +    _notmuch_database_log(_notmuch_message_database (message), "A Xapian exception occurred creating message: %s\n",
> +      error.get_msg().c_str());
is this the right exception message?  seems like it should talk about
removing values rather than "creating message"

> @@ -1114,7 +1144,13 @@ _notmuch_message_set_header_values (notmuch_message_t *message,
>      message->doc.add_value (NOTMUCH_VALUE_TIMESTAMP,
>      Xapian::sortable_serialise (time_value));
>      message->doc.add_value (NOTMUCH_VALUE_FROM, from);
> -    message->doc.add_value (NOTMUCH_VALUE_SUBJECT, subject);
> +
> +    old_subject = message->doc.get_value (NOTMUCH_VALUE_SUBJECT);
> +    if (old_subject.empty())
> + message->doc.add_value (NOTMUCH_VALUE_SUBJECT, subject);
> +    else
> + message->doc.add_value (NOTMUCH_VALUE_SUBJECT, old_subject + "\n" + subject);
here we're appending the subject to the tail -- so the first injected
subject line stays at the top.

it looks like it will re-add the same subject line multiple times,
even if already present.  so if i get 3 copies of a message with subject
"foo" then the value slot will be "foo\nfoo\nfoo".  is that desirable?

i think either we care about careful ordering (which strikes me as
delicate and potentially difficult to handle, esp. when you get into the
headers changing depending on whether you index the cleartext or not),
or we should avoid injecting duplicates.

wdyt?

        --dkg

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch

signature.asc (233 bytes) Download Attachment
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] WIP: add all subjects to value.

Daniel Kahn Gillmor <[hidden email]> writes:

>
> i think either we care about careful ordering (which strikes me as
> delicate and potentially difficult to handle, esp. when you get into the
> headers changing depending on whether you index the cleartext or not),
> or we should avoid injecting duplicates.
>

I'm not opposed to keeping the set of unique headers (although there is
a bit of a tarpit in deciding when two Subjects are the "same"). The
question I'm not sure of a good answer to is how we handle display. In
particular this patch doesn't make any effort to keep the subject
consistent with the Date and From headers. From the point of view of
displaying things, it makes sense to keep "parallel lists" in the value
slots by doing the same trick as with Subject. On the gripping hand,
sensible date search really requires one value per document.  In
principle we could make more extensive use of the document "data
area". This can store an arbitrary string, and is the recommended place
to store information needed to display the document.

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch