notmuch ignoring alot of emails

classic Classic list List threaded Threaded
15 messages Options
Alexei Gilchrist Alexei Gilchrist
Reply | Threaded
Open this post in threaded view
|

notmuch ignoring alot of emails

Hi

When I run notmuch I get a bunch (hundreds) of emails that are ignored
with:

Note: Ignoring non-mail file: ...

The files are valid maildir files but have a paragraph somewhere in the
body where someone has written "From ".

Is there a fix to force the recognition of maildir files in this case? I
thought this was a solved problem with gmime since 2.6.7.

Sorry for the pun in the subject but I am using alot and I only see the
messages notmuch sees, neomutt has no issues seeing these messages but I
want a tighter integration with notmuch.

I'm on a mac and compiled notmuch-0.28.3; installed gmime 3.2.3 with
brew, and verified notmuch was linking against it:

≻ otool -L /usr/local/bin/notmuch
/usr/local/bin/notmuch:
        /usr/local/lib/libnotmuch.5.dylib (compatibility version 5.2.0, current
version 5.2.0)
        /usr/local/opt/gmime/lib/libgmime-3.0.0.dylib (compatibility version
202.0.0, current version 202.2.0)
        /usr/local/opt/glib/lib/libgio-2.0.0.dylib (compatibility version
6001.0.0, current version 6001.0.0)
        /usr/local/opt/glib/lib/libgobject-2.0.0.dylib (compatibility version
6001.0.0, current version 6001.0.0)
        /usr/local/opt/glib/lib/libglib-2.0.0.dylib (compatibility version
6001.0.0, current version 6001.0.0)
        /usr/local/opt/gettext/lib/libintl.8.dylib (compatibility version
10.0.0, current version 10.5.0)
        /usr/local/opt/talloc/lib/libtalloc.dylib (compatibility version 0.0.0,
current version 0.0.0)
        /usr/lib/libz.1.dylib (compatibility version 1.0.0, current version
1.2.11)
        /usr/local/opt/xapian/lib/libxapian.30.dylib (compatibility version
36.0.0, current version 36.1.0)
        /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version
400.9.4)
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current
version 1252.200.5)

≻ ls -l /usr/local/opt/gmime
lrwxr-xr-x  1 alexei  admin  21 23 Mar 12:09 /usr/local/opt/gmime ->
../Cellar/gmime/3.2.3

≻ ls -l /usr/local/Cellar/gmime/3.2.3/lib/
total 2280
drwxr-xr-x  3 alexei  staff      96 27 Nov 11:09 girepository-1.0
-rw-r--r--  1 alexei  staff  444500 23 Mar 12:09 libgmime-3.0.0.dylib
-r--r--r--  1 alexei  staff  720504 27 Nov 11:09 libgmime-3.0.a
lrwxr-xr-x  1 alexei  staff      20 27 Nov 11:09 libgmime-3.0.dylib ->
libgmime-3.0.0.dylib
drwxr-xr-x  3 alexei  staff      96 23 Mar 12:09 pkgconfig

Any ideas for a fix?

cheers,

Alexei
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: notmuch ignoring alot of emails

"Alexei Gilchrist" <[hidden email]> writes:

> Hi
>
> When I run notmuch I get a bunch (hundreds) of emails that are ignored
> with:
>
> Note: Ignoring non-mail file: ...
>
> The files are valid maildir files but have a paragraph somewhere in the
> body where someone has written "From ".
>

And do they also have have a line starting with "From " as the first
line? This makes them mbox files. The second "From " makes them mbox
files with multiple messages. Notmuch thinks your MDA (the thing that
made those files) is misconfigured, assuming my guess about the format
is correct.

> Is there a fix to force the recognition of maildir files in this case? I
> thought this was a solved problem with gmime since 2.6.7.

There is not currently a way to do that. It's not a GMime problem, it's
a design choice of notmuch to avoid parsing multiple message
mbox's. That was originally added as a safety feature, and I think it
should probably stay the default. If someone wants work on adding a
configuration switch I can point them in the right direction.




_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Alexei Gilchrist Alexei Gilchrist
Reply | Threaded
Open this post in threaded view
|

Re: notmuch ignoring alot of emails

When I run notmuch I get a bunch (hundreds) of emails that are ignored
with:

Note: Ignoring non-mail file: ...

The files are valid maildir files but have a paragraph somewhere in the
body where someone has written "From ".

And do they also have have a line starting with "From " as the first
line? This makes them mbox files. The second "From " makes them mbox
files with multiple messages. Notmuch thinks your MDA (the thing that
made those files) is misconfigured, assuming my guess about the format
is correct.

Every message file begins with “From “. This is true of all messages downloaded by both offlineimap (with type = Maildir) and mbsync.
neomutt has no issues dealing with these files as maildir and mu has no issues indexing them either. I’m assuming that stating with “From “ is part of the maildir spec.

The problem occurs specifically with notmuch. If someone sends a message with a line that begins with “From “ in the body then it confuses notmuch.

mu can correctly index these messages but my mu is linked against libgmime-2.6, my notmuch (0.28.3) is linked against libgmime-3.0.

Is there a fix to force the recognition of maildir files in this case? I
thought this was a solved problem with gmime since 2.6.7.

There is not currently a way to do that. It's not a GMime problem, it's
a design choice of notmuch to avoid parsing multiple message
mbox's. That was originally added as a safety feature, and I think it
should probably stay the default. If someone wants work on adding a
configuration switch I can point them in the right direction.

This is a poor design decision. It means anyone on the internet can break your mail setup simply by sending a message with a line starting with “From “.
(and using usual quoted-printable Content-Transfer-Encoding).

Try it. Send yourself a message with the line “From bad parsing comes chaos” and see if your notmuch can find it. My version can’t.


_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: notmuch ignoring alot of emails

"Alexei Gilchrist" <[hidden email]> writes:

>
> Try it. Send yourself a message with the line “From bad parsing comes
> chaos” and see if your notmuch can find it. My version can’t.

It's not that simple. My MDA is configured not to add the initial mbox
"From " line to files in maildirs.

d
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Tomi Ollila-2 Tomi Ollila-2
Reply | Threaded
Open this post in threaded view
|

Re: notmuch ignoring alot of emails

In reply to this post by Alexei Gilchrist
On Sun, Mar 31 2019, Alexei Gilchrist wrote:

>>> When I run notmuch I get a bunch (hundreds) of emails that are
>>> ignored
>>> with:
>>>
>>> Note: Ignoring non-mail file: ...
>>>
>>> The files are valid maildir files but have a paragraph somewhere in
>>> the
>>> body where someone has written "From ".
>>>
>>
>> And do they also have have a line starting with "From " as the first
>> line? This makes them mbox files. The second "From " makes them mbox
>> files with multiple messages. Notmuch thinks your MDA (the thing that
>> made those files) is misconfigured, assuming my guess about the format
>> is correct.
>
> Every message file begins with “From “. This is true of all messages
> downloaded by both offlineimap (with type = Maildir) and mbsync.
> neomutt has no issues dealing with these files as maildir and mu has no
> issues indexing them either. I’m assuming that stating with “From
> “ is part of the maildir spec.
>
> The problem occurs specifically with notmuch. If someone sends a message
> with a line that begins with “From “ in the *body* then it confuses
> notmuch.
>
> mu can correctly index these messages but my mu is linked against
> libgmime-2.6, my notmuch (0.28.3) is linked against libgmime-3.0.
>
>
>>> Is there a fix to force the recognition of maildir files in this
>>> case? I
>>> thought this was a solved problem with gmime since 2.6.7.
>>
>> There is not currently a way to do that. It's not a GMime problem,
>> it's
>> a design choice of notmuch to avoid parsing multiple message
>> mbox's. That was originally added as a safety feature, and I think it
>> should probably stay the default. If someone wants work on adding a
>> configuration switch I can point them in the right direction.
>
> This is a poor design decision. It means anyone on the internet can
> break your mail setup simply by sending a message with a line starting
> with “From “.
> (and using usual quoted-printable Content-Transfer-Encoding).

There are few things to remember in notmuch development:

- notmuch is more of an evolution than intelligent design. it is hard to
  do any long-planned design when writing email software...

- we all do welcome people do SMOP with notmuch and tolerate patches with
  good commit messages and elegant content.

- it may take some time to get changes reviewed...

In this particular case it would be nice if someone(tm) investigated how
mu and neomutt handles these email -- and how broken (if at all) those go
if those are given large mbox file... was it so that both of those can
read mbox files...
(which notmuch doesn't (but one can always use mboxvievfs! >;)))?

> Try it. Send yourself a message with the line “From bad parsing comes
> chaos” and see if your notmuch can find it. My version can’t.

My MDA (md5mda.sh) does not add 'From ' as beginning of first
line in my delivered emails (i.e. works similarly in this respect as
David's MDA).

Tomi
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Tomas Nordin Tomas Nordin
Reply | Threaded
Open this post in threaded view
|

Re: notmuch ignoring alot of emails

In reply to this post by Alexei Gilchrist
Alexei Gilchrist <[hidden email]> writes:

> Every message file begins with “From “. This is true of all messages
> downloaded by both offlineimap (with type = Maildir) and mbsync.
> neomutt has no issues dealing with these files as maildir and mu has no
> issues indexing them either. I’m assuming that stating with “From
> “ is part of the maildir spec.

FWIW, I use Offlineimap and files retreived with it here does not begin
with "From". I see things like "Received: from..." or "Return-Path:..."
as the beginning of the first line.

> Try it. Send yourself a message with the line “From bad parsing comes
> chaos” and see if your notmuch can find it. My version can’t.

I tried that and find messages as expected. I mean, the message I sent
and this thread.

Best regards
--
Tomas
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Alexei Gilchrist Alexei Gilchrist
Reply | Threaded
Open this post in threaded view
|

Re: notmuch ignoring alot of emails

That’s interesting. Do you know a link to the file spec for maildir
file content? All I can find is information about the directory
structure and file naming, not the file content.

mbsync which specialises in maildir also had an initial “From “ line
for me, and they are independently configured. I’ll try out a couple
of different mail hosts to see if it’s that.

I can imagine that mutt just assumes they are maildir files once
configured that way, but mu also assumes the files are maildir and also
uses gmime to parse. However the current version on home-brew (Mac) is
linked to a version of gmime which was fixed to accomodate multiple
“From “ lines I believe, though I haven’t dug through the source
yet.

Cheers,

Alexei

On 31 Mar 2019, at 22:00, Tomas Nordin wrote:

> Alexei Gilchrist <[hidden email]> writes:
>
>> Every message file begins with “From “. This is true of all
>> messages
>> downloaded by both offlineimap (with type = Maildir) and mbsync.
>> neomutt has no issues dealing with these files as maildir and mu has
>> no
>> issues indexing them either. I’m assuming that stating with “From
>> “ is part of the maildir spec.
>
> FWIW, I use Offlineimap and files retreived with it here does not
> begin
> with "From". I see things like "Received: from..." or
> "Return-Path:..."
> as the beginning of the first line.
>
>> Try it. Send yourself a message with the line “From bad parsing
>> comes
>> chaos” and see if your notmuch can find it. My version can’t.
>
> I tried that and find messages as expected. I mean, the message I sent
> and this thread.
>
> Best regards
> --
> Tomas
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: notmuch ignoring alot of emails

"Alexei Gilchrist" <[hidden email]> writes:

> That’s interesting. Do you know a link to the file spec for maildir
> file content? All I can find is information about the directory
> structure and file naming, not the file content.

As far as I know, this is specified by RFC 5322.

> mbsync which specialises in maildir also had an initial “From “ line
> for me, and they are independently configured. I’ll try out a couple
> of different mail hosts to see if it’s that.

Yes, it could well determined by how the messages are delivered on the
server.

> I can imagine that mutt just assumes they are maildir files once
> configured that way, but mu also assumes the files are maildir and also
> uses gmime to parse. However the current version on home-brew (Mac) is
> linked to a version of gmime which was fixed to accomodate multiple
> “From “ lines I believe, though I haven’t dug through the source
> yet.

As I mentioned above, it's not really related to the version of GMime,
it's about how GMime is called, and whether the client wishes to parse
mbox files containing more than one message. Or to ignore the "From "
line at the beginning.
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Alvaro Herrera Alvaro Herrera
Reply | Threaded
Open this post in threaded view
|

Re: notmuch ignoring alot of emails

In reply to this post by Alexei Gilchrist
On 2019-Mar-23, Alexei Gilchrist wrote:

> When I run notmuch I get a bunch (hundreds) of emails that are ignored with:
>
> Note: Ignoring non-mail file: ...
>
> The files are valid maildir files but have a paragraph somewhere in the body
> where someone has written "From ".

Yeah, that happens too when you attach patches generated with git
format-patch as plain text; this is extremely common in the
[hidden email] mailing list (you can download an
mbox from there for any month, convert it to a maildir, and give the
resulting maildir to notmuch -- you'll likely find a few dozen emails
that fail parsing).  This is a very annoying problem for me, see
[hidden email] in this list earlier this year.

I worked around it by patching _notmuch_message_file_parse in
lib/message-file.c to set is_mbox = false unconditionally; but that's
not a real solution (and hence I didn't post as a patch here), and it
explodes real good if you have an actual mbox in the directory where the
mail is (since after the hack it won't skip it anymore).

I think a real solution is to parse the message header, look for the
Content-Length, and determine mbox-ness by looking for "From" only past
that many bytes; that seems to match what other mail parsing tools do.
However, I haven't gotten around to doing that.

--
Álvaro Herrera                            39°49'30"S 73°17'W
"La experiencia nos dice que el hombre peló millones de veces las patatas,
pero era forzoso admitir la posibilidad de que en un caso entre millones,
las patatas pelarían al hombre" (Ijon Tichy)
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Alvaro Herrera Alvaro Herrera
Reply | Threaded
Open this post in threaded view
|

Re: notmuch ignoring alot of emails

On 2019-Jun-28, Alvaro Herrera wrote:

> I think a real solution is to parse the message header, look for the
> Content-Length, and determine mbox-ness by looking for "From" only past
> that many bytes; that seems to match what other mail parsing tools do.

Sorry, I misspoke: there's no such thing as Content-Length.
It's Content-Type/boundary that needs to be watched for.  Only consider
that the file is an mbox if a "^From " line appears after the boundary
end marker (which seems to be defined as "the boundary string followed
by two dashes --").

Here's a sample message, BTW:
https://www.postgresql.org/message-id/raw/3ad5ba71-d200-96da-f903-7e3b16416140@...
(username "archives", password "antispam").

--
Álvaro Herrera       Valdivia, Chile
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: notmuch ignoring alot of emails

Alvaro Herrera <[hidden email]> writes:

> On 2019-Jun-28, Alvaro Herrera wrote:
>
>> I think a real solution is to parse the message header, look for the
>> Content-Length, and determine mbox-ness by looking for "From" only past
>> that many bytes; that seems to match what other mail parsing tools do.
>
> Sorry, I misspoke: there's no such thing as Content-Length.
> It's Content-Type/boundary that needs to be watched for.  Only consider
> that the file is an mbox if a "^From " line appears after the boundary
> end marker (which seems to be defined as "the boundary string followed
> by two dashes --").
>
> Here's a sample message, BTW:
> https://www.postgresql.org/message-id/raw/3ad5ba71-d200-96da-f903-7e3b16416140@...
> (username "archives", password "antispam").

I'm not keen on writing (more) ad hoc MIME parsing code, so if you can
phrase this in terms of GMime API (or at least MIME parts) it would be
great.

d
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: notmuch ignoring alot of emails

David Bremner <[hidden email]> writes:

> Alvaro Herrera <[hidden email]> writes:
>
>> On 2019-Jun-28, Alvaro Herrera wrote:
>>
>>> I think a real solution is to parse the message header, look for the
>>> Content-Length, and determine mbox-ness by looking for "From" only past
>>> that many bytes; that seems to match what other mail parsing tools do.
>>
>> Sorry, I misspoke: there's no such thing as Content-Length.
>> It's Content-Type/boundary that needs to be watched for.  Only consider
>> that the file is an mbox if a "^From " line appears after the boundary
>> end marker (which seems to be defined as "the boundary string followed
>> by two dashes --").
>>
>> Here's a sample message, BTW:
>> https://www.postgresql.org/message-id/raw/3ad5ba71-d200-96da-f903-7e3b16416140@...
>> (username "archives", password "antispam").
>
> I'm not keen on writing (more) ad hoc MIME parsing code, so if you can
> phrase this in terms of GMime API (or at least MIME parts) it would be
> great.
>
> d

On second thought, I guess it might not be practical to use GMime to parse
the file, since that might perform badly on large mboxes.

d
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Tomi Ollila-2 Tomi Ollila-2
Reply | Threaded
Open this post in threaded view
|

Re: notmuch ignoring alot of emails

In reply to this post by Alvaro Herrera
On Fri, Jun 28 2019, Alvaro Herrera wrote:

> On 2019-Jun-28, Alvaro Herrera wrote:
>
>> I think a real solution is to parse the message header, look for the
>> Content-Length, and determine mbox-ness by looking for "From" only past
>> that many bytes; that seems to match what other mail parsing tools do.
>
> Sorry, I misspoke: there's no such thing as Content-Length.
> It's Content-Type/boundary that needs to be watched for.  Only consider
> that the file is an mbox if a "^From " line appears after the boundary
> end marker (which seems to be defined as "the boundary string followed
> by two dashes --").

Just checking line starting with 'From ' would be pretty naïve since
From may be first word in any line in text body.

If we'd have to do content scanning then at least empty line before
From would be reguired, and next lines starting like
Received: [hidden email]
Date: a date
From: someone

(and then empty line... ;)

all this checkin would be required and still it could fail (perhaps
this content get modified in the fly, but then signature check, if
this mail had one, could fail...)

If there is header that tells the length of the body, then things
could be easier...

Tomi

>
> --
> Álvaro Herrera       Valdivia, Chile
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Alvaro Herrera Alvaro Herrera
Reply | Threaded
Open this post in threaded view
|

Re: notmuch ignoring alot of emails

In reply to this post by David Bremner-2
On 2019-Jun-29, David Bremner wrote:

> David Bremner <[hidden email]> writes:
>
> > Alvaro Herrera <[hidden email]> writes:

> >> It's Content-Type/boundary that needs to be watched for.  Only consider
> >> that the file is an mbox if a "^From " line appears after the boundary
> >> end marker (which seems to be defined as "the boundary string followed
> >> by two dashes --").

> > I'm not keen on writing (more) ad hoc MIME parsing code, so if you can
> > phrase this in terms of GMime API (or at least MIME parts) it would be
> > great.

Yeah, I was having a look at the GMime API last week to have a think
about how to do it with that.

> On second thought, I guess it might not be practical to use GMime to parse
> the file, since that might perform badly on large mboxes.

I think we only need to search for the first end boundary; if there's
anything beyond that, return is_mbox true.  So we only need to fully
process the first email, and we can stop searching at that point.

--
Álvaro Herrera                                http://www.twitter.com/alvherre
"Puedes vivir sólo una vez, pero si lo haces bien, una vez es suficiente"
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Alvaro Herrera Alvaro Herrera
Reply | Threaded
Open this post in threaded view
|

Re: notmuch ignoring alot of emails

In reply to this post by Tomi Ollila-2
On 2019-Jun-30, Tomi Ollila wrote:

> Just checking line starting with 'From ' would be pretty naïve since
> From may be first word in any line in text body.

Even so, early mail systems relied on there not being any such lines,
and they escaped those lines to be ">From" or to use quoted-printable
encoding.  GMime has bespoke code to do this, in fact.  Mail systems
stopped doing this escaping after MIME boundaries got more widely used,
I suppose.

I think NNTP used content length much more extensively than email.  Of
course, NNTP is almost disappeared now ...

> If we'd have to do content scanning then at least empty line before
> From would be reguired, and next lines starting like
> Received: [hidden email]
> Date: a date
> From: someone
>
> (and then empty line... ;)
>
> all this checkin would be required and still it could fail (perhaps
> this content get modified in the fly, but then signature check, if
> this mail had one, could fail...)

This logic still fails if you have mail-like content in the mail, such
as attachments produced by "git format-patch".  Many open source lists
don't have this problem because they use "git send-email" instead, but
this is not universal.

> If there is header that tells the length of the body, then things
> could be easier...

Early emails had Content-Length as a header, but it was not universal,
and nowadays it seems to have been abandoned as a practice; the MIME
content boundary is used universally (or at least I cannot find any
recent divergence from this practice.)

--
Álvaro Herrera                                http://www.twitter.com/alvherre
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch