RFC: Multiple filenames for email messages

classic Classic list List threaded Threaded
2 messages Options
Jan Janak-2 Jan Janak-2
Reply | Threaded
Open this post in threaded view
|

RFC: Multiple filenames for email messages

The comment of _notmuch_message_set_filename says:

   XXX: We should still figure out if we think it's important to store
   multiple filenames for email messages with identical message IDs.

I have lots of such messages in my email collection, both in my local
copy of my Gmail account and also in the local copy of my company's
IMAP account.

My dream mail indexing tool should be able to apply tags automatically
based on, among other things, the name of the directory the message is
stored in. If there are multiple copies of the same message scattered
across multiple directories, I would like to apply more tags.

I assume that most tags will be applied (either manually or
automatically) after 'notmuch-new', I currently do some of it with a
simple shell script. The script does not apply tags based on directory
names yet, but it would make notmuch really flexible if we could do
that *and* if we could get access to all filenames of a particular
message.

I'd like to propose that we store all filenames for email messages in
the database, not just one per message. I'd be happy to work on it and
submit a patch if others think that this would be good to have.

  -- Jan

Carl Worth-2 Carl Worth-2
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Multiple filenames for email messages

On Sat, 21 Nov 2009 23:37:24 +0100, Jan Janak <[hidden email]> wrote:
> The comment of _notmuch_message_set_filename says:
>
>    XXX: We should still figure out if we think it's important to store
>    multiple filenames for email messages with identical message IDs.
...
> I'd like to propose that we store all filenames for email messages in
> the database, not just one per message. I'd be happy to work on it and
> submit a patch if others think that this would be good to have.

Oh, sure. As soon as we start using filenames for searches, then that
makes a lot of sense.

Currently, notmuch isn't storing any filename that way, but should be,
(need to just add a prefix to the table at the top of lib/database.cc,
document it, and then make the indexing stage generate terms from the
filename with that prefix).

The term generator and query parser should do the right thing, which is
to split the filename into individual terms at each '/', store position
data with each, and then turn a search like:

        filename:some/filename/segment

into a phrase search that looks for the terms "some", "filename", and
"segment", each with the filename prefix you choose and each in
sequential position. Note that if you compile notmuch with CFLAGS
including -DDEBUG then you'll see a nice report of the post-parsed query
that's useful for debugging stuff like this.

The reason for my comment was related to the other use of the filename,
(that is, the only one we're currently using). This is with regard to
querying the database for the actual filename, rather than searching on
it. For this, we don't use terms, but instead use the "data" field of
the document. I was wondering if in the presentation of an email message
it would ever be important to have access to the multiple files.

Can anyone think of a case where they would need that? That is, a case
where you care about the distinct content of two messages that have the
same message ID?

I suppose that in the case of getting a message by two paths, (say
through a mailing list and also via CC), one might want to inspect the
different headers in the two versions. So maybe we'll need to break down
and provide this information to the interfaces.

Also, if we're going to support file deletion well, then I suppose we
really will need to store all the filenames, (so if one disappears we
can still point to the others). Also, we'll need to be able to
accurately update the filename terms when a message disappears, so that
means having all of the complete filenames around.

So I guess I'm convincing myself that we really should store all the
filenames, and also provide an interface to get a list of filenames for
a message, (but also expect that many users of the API will only want to
look at the first filename in the list).

-Carl