Quantcast

[PATCH] test: add known broken test for indexing html

classic Classic list List threaded Threaded
7 messages Options
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH] test: add known broken test for indexing html

'quite' on IRC reported that notmuch new was grinding to a halt during
initial indexing, and we eventually narrowed the problem down to some
html parts with large embedded images. These cause the number of terms
added to the Xapian database to explode (the first 400 messages
generated 4.6M unique terms), and of course the resulting terms are
not much use for searching.
---

I'm not sure the best approach to fix this. Workarounds include
limiting the size of the part indexed, and skipping html parts. The
latter is easy, but probably too drastic.  A nice solution might be a
filter similar to the existing one that strips out uuencoded text but
for base64. Alas base64 crud seems to come with all kinds of syntactic
wrappers, so it's probably harder to filter.


 test/T680-html-indexing.sh       | 12 +++++++
 test/corpora/README              |  3 ++
 test/corpora/html/embedded-image | 69 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 84 insertions(+)
 create mode 100755 test/T680-html-indexing.sh
 create mode 100644 test/corpora/html/embedded-image

diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh
new file mode 100755
index 00000000..78768c4f
--- /dev/null
+++ b/test/T680-html-indexing.sh
@@ -0,0 +1,12 @@
+#!/usr/bin/env bash
+test_description="indexing of html parts"
+. ./test-lib.sh || exit 1
+
+add_email_corpus html
+
+test_begin_subtest 'embedded images should not be indexed'
+test_subtest_known_broken
+notmuch search kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 > OUTPUT
+test_expect_equal_file /dev/null OUTPUT
+
+test_done
diff --git a/test/corpora/README b/test/corpora/README
index 77c48e6e..c9a35fed 100644
--- a/test/corpora/README
+++ b/test/corpora/README
@@ -9,3 +9,6 @@ default
 broken
   The broken corpus contains messages that are broken and/or RFC
   non-compliant, ensuring we deal with them in a sane way.
+
+html
+  The html corpus contains html parts
diff --git a/test/corpora/html/embedded-image b/test/corpora/html/embedded-image
new file mode 100644
index 00000000..40851530
--- /dev/null
+++ b/test/corpora/html/embedded-image
@@ -0,0 +1,69 @@
+From: =?utf-8?b?bWFsbW9ib3Jn?= <[hidden email]>
+To: =?utf-8?b?Ym9lbmRlLm1hbG1vYm9yZw==?= <[hidden email]>
+Date: Tue, 19 Jul 2016 11:54:24 +0200
+X-Feed2Imap-Version: 1.2.5
+Message-Id: <[hidden email]>
+Subject: =?utf-8?b?VGFjayBhbGxhIHRyYWZpa2FudGVyIG9jaCBmb3Rnw6RuZ2FyZSE=?=
+Content-Type: multipart/alternative; boundary="=-1468922508-176605-12427-9500-21-="
+MIME-Version: 1.0
+
+
+--=-1468922508-176605-12427-9500-21-=
+Content-Type: text/plain; charset=utf-8; format=flowed
+Content-Transfer-Encoding: 8bit
+
+<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/>
+
+Malmö 2016-07-09
+
+I skrivande stund är vi i färd med att avetablera vår entreprenad på
+Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett större
+dräneringsarbete som i sin tur har inneburit vissa
+trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några
+veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och vi
+kan glatt meddela att båda vägfilerna kommer att öppnas inom kort. Nu
+kommer den vackra fastigheten att klara sig torrskodd under många år
+framöver [A]
+

+
+[A] http://malmoborg.se/wp-includes/images/smilies/icon_smile.gif
+--
+Feed: Förvaltnings AB Malmöborg
+<http://malmoborg.se>
+Item: Tack alla trafikanter och fotgängare!
+<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/>
+Date: 2016-07-19 11:54:24 +0200
+Author: malmoborg
+Filed under: Nyheter
+
+--=-1468922508-176605-12427-9500-21-=
+Content-Type: text/html; charset=utf-8
+Content-Transfer-Encoding: 8bit
+
+<table border="1" width="100%" cellpadding="0" cellspacing="0" borderspacing="0"><tr><td>
+<table width="100%" bgcolor="#EDEDED" cellpadding="4" cellspacing="2">
+<tr><td align="right"><b>Feed:</b></td>
+<td width="100%"><a href="http://malmoborg.se">
+<b>Förvaltnings AB Malmöborg</b>
+</a>
+</td></tr><tr><td align="right"><b>Item:</b></td>
+<td width="100%"><a href="http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/"><b>Tack alla trafikanter och fotgängare!</b>
+</a>
+</td></tr></table></td></tr></table>
+
+<p>Malmö 2016-07-09</p>
+<p>I skrivande stund är vi i färd med att avetablera vår entreprenad på Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett större dräneringsarbete som i sin tur har inneburit vissa trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och vi kan glatt meddela att båda vägfilerna kommer att öppnas inom kort. Nu kommer den vackra fastigheten att klara sig torrskodd under många år framöver <img src="
+xzMzM///6//lAAAAAAAAACH5BAEAAA4ALAAAAAAPAA8AAARb0EkZap3YVabO
+GRcWcAgCnIMRTEEnCCfwpqt2mHEOagoOnz+CKnADxoKFyiHHBBCSAdOiCVg8
+KwPZa7sVrgJZQWI8FhB2msGgwTXTWGqCXP4WBQr4wjDDstQmEQA7
+" alt=":-)" class="wp-smiley" /> </p>
+<p>&nbsp;</p>
+<hr width="100%"/>
+<table width="100%" cellpadding="0" cellspacing="0">
+<tr><td align="right"><font color="#ababab">Date:</font>&nbsp;&nbsp;</td><td><font color="#ababab">2016-07-19 11:54:24 +0200</font></td></tr>
+<tr><td align="right"><font color="#ababab">Author:</font>&nbsp;&nbsp;</td><td><font color="#ababab">malmoborg</font></td></tr>
+<tr><td align="right"><font color="#ababab">Filed under:</font>&nbsp;&nbsp;</td><td><font color="#ababab">Nyheter</font></td></tr>
+</table>
+
+--=-1468922508-176605-12427-9500-21-=--
--
2.11.0

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Jeffrey Stedfast-2 Jeffrey Stedfast-2
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: [PATCH] test: add known broken test for indexing html

Hi David,

Base64 encoded inline image data is always within the src attribute value of an <img> tag and will always begin with "data:" followed by the mime-type and then followed by ";base64," so it's pretty easy to spot.

While on this topic, why index HTML attribute values at all? Other than perhaps some known ones like perhaps the 'alt' value of <img> tags?

I would argue that the only portion of any HTML that you should be indexing at all for searching is the character data between tags.

Hope my $0.02 helps,

Jeff

> -----Original Message-----
> From: notmuch [mailto:[hidden email]] On Behalf Of
> David Bremner
> Sent: Saturday, March 18, 2017 9:25 AM
> To: [hidden email]
> Subject: [PATCH] test: add known broken test for indexing html
>
> 'quite' on IRC reported that notmuch new was grinding to a halt during initial
> indexing, and we eventually narrowed the problem down to some html parts
> with large embedded images. These cause the number of terms added to
> the Xapian database to explode (the first 400 messages generated 4.6M
> unique terms), and of course the resulting terms are not much use for
> searching.
> ---
>
> I'm not sure the best approach to fix this. Workarounds include limiting the
> size of the part indexed, and skipping html parts. The latter is easy, but
> probably too drastic.  A nice solution might be a filter similar to the existing
> one that strips out uuencoded text but for base64. Alas base64 crud seems
> to come with all kinds of syntactic wrappers, so it's probably harder to filter.
>
>
>  test/T680-html-indexing.sh       | 12 +++++++
>  test/corpora/README              |  3 ++
>  test/corpora/html/embedded-image | 69
> ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 84 insertions(+)
>  create mode 100755 test/T680-html-indexing.sh  create mode 100644
> test/corpora/html/embedded-image
>
> diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh new file
> mode 100755 index 00000000..78768c4f
> --- /dev/null
> +++ b/test/T680-html-indexing.sh
> @@ -0,0 +1,12 @@
> +#!/usr/bin/env bash
> +test_description="indexing of html parts"
> +. ./test-lib.sh || exit 1
> +
> +add_email_corpus html
> +
> +test_begin_subtest 'embedded images should not be indexed'
> +test_subtest_known_broken
> +notmuch search
> kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 >
> +OUTPUT test_expect_equal_file /dev/null OUTPUT
> +
> +test_done
> diff --git a/test/corpora/README b/test/corpora/README index
> 77c48e6e..c9a35fed 100644
> --- a/test/corpora/README
> +++ b/test/corpora/README
> @@ -9,3 +9,6 @@ default
>  broken
>    The broken corpus contains messages that are broken and/or RFC
>    non-compliant, ensuring we deal with them in a sane way.
> +
> +html
> +  The html corpus contains html parts
> diff --git a/test/corpora/html/embedded-image
> b/test/corpora/html/embedded-image
> new file mode 100644
> index 00000000..40851530
> --- /dev/null
> +++ b/test/corpora/html/embedded-image
> @@ -0,0 +1,69 @@
> +From: =?utf-8?b?bWFsbW9ib3Jn?= <[hidden email]>
> +To: =?utf-8?b?Ym9lbmRlLm1hbG1vYm9yZw==?= <[hidden email]>
> +Date: Tue, 19 Jul 2016 11:54:24 +0200
> +X-Feed2Imap-Version: 1.2.5
> +Message-Id: <[hidden email]>
> +Subject:
> +=?utf-
> 8?b?VGFjayBhbGxhIHRyYWZpa2FudGVyIG9jaCBmb3Rnw6RuZ2FyZSE=?=
> +Content-Type: multipart/alternative; boundary="=-1468922508-176605-
> 12427-9500-21-="
> +MIME-Version: 1.0
> +
> +
> +--=-1468922508-176605-12427-9500-21-=
> +Content-Type: text/plain; charset=utf-8; format=flowed
> +Content-Transfer-Encoding: 8bit
> +
> +<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/>
> +
> +Malmö 2016-07-09
> +
> +I skrivande stund är vi i färd med att avetablera vår entreprenad på
> +Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett
> +större dräneringsarbete som i sin tur har inneburit vissa
> +trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några
> +veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och
> +vi kan glatt meddela att båda vägfilerna kommer att öppnas inom kort.
> +Nu kommer den vackra fastigheten att klara sig torrskodd under många år
> +framöver [A]
> +
> +
> +
> +[A] http://malmoborg.se/wp-includes/images/smilies/icon_smile.gif
> +--
> +Feed: Förvaltnings AB Malmöborg
> +<http://malmoborg.se>
> +Item: Tack alla trafikanter och fotgängare!
> +<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/>
> +Date: 2016-07-19 11:54:24 +0200
> +Author: malmoborg
> +Filed under: Nyheter
> +
> +--=-1468922508-176605-12427-9500-21-=
> +Content-Type: text/html; charset=utf-8
> +Content-Transfer-Encoding: 8bit
> +
> +<table border="1" width="100%" cellpadding="0" cellspacing="0"
> +borderspacing="0"><tr><td> <table width="100%" bgcolor="#EDEDED"
> +cellpadding="4" cellspacing="2"> <tr><td
> +align="right"><b>Feed:</b></td> <td width="100%"><a
> +href="http://malmoborg.se"> <b>Förvaltnings AB Malmöborg</b> </a>
> +</td></tr><tr><td align="right"><b>Item:</b></td> <td width="100%"><a
> +href="http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/
> +"><b>Tack alla trafikanter och fotgängare!</b> </a>
> +</td></tr></table></td></tr></table>
> +
> +<p>Malmö 2016-07-09</p>
> +<p>I skrivande stund är vi i färd med att avetablera vår entreprenad på
> +Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett
> +större dräneringsarbete som i sin tur har inneburit vissa
> +trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några
> +veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och
> +vi kan glatt meddela att båda vägfilerna kommer att öppnas inom kort.
> +Nu kommer den vackra fastigheten att klara sig torrskodd under många år
> +framöver <img
> +src="
> JAP+0AP6d
> +AP/+k//9E///////
> +xzMzM///6//lAAAAAAAAACH5BAEAAA4ALAAAAAAPAA8AAARb0EkZap3YV
> abO
> +GRcWcAgCnIMRTEEnCCfwpqt2mHEOagoOnz+CKnADxoKFyiHHBBCSAdOiCV
> g8
> +KwPZa7sVrgJZQWI8FhB2msGgwTXTWGqCXP4WBQr4wjDDstQmEQA7
> +" alt=":-)" class="wp-smiley" /> </p>
> +<p>&nbsp;</p>
> +<hr width="100%"/>
> +<table width="100%" cellpadding="0" cellspacing="0"> <tr><td
> +align="right"><font
> +color="#ababab">Date:</font>&nbsp;&nbsp;</td><td><font
> +color="#ababab">2016-07-19 11:54:24 +0200</font></td></tr> <tr><td
> +align="right"><font
> +color="#ababab">Author:</font>&nbsp;&nbsp;</td><td><font
> +color="#ababab">malmoborg</font></td></tr>
> +<tr><td align="right"><font color="#ababab">Filed
> +under:</font>&nbsp;&nbsp;</td><td><font
> +color="#ababab">Nyheter</font></td></tr>
> +</table>
> +
> +--=-1468922508-176605-12427-9500-21-=--
> --
> 2.11.0
>
> _______________________________________________
> notmuch mailing list
> [hidden email]
> https://notmuchmail.org/mailman/listinfo/notmuch
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: [PATCH] test: add known broken test for indexing html

Jeffrey Stedfast <[hidden email]> writes:

> Hi David,
>
> Base64 encoded inline image data is always within the src attribute value of an <img> tag and will always begin with "data:" followed by the mime-type and then followed by ";base64," so it's pretty easy to spot.
>
> While on this topic, why index HTML attribute values at all? Other than perhaps some known ones like perhaps the 'alt' value of <img> tags?
>
> I would argue that the only portion of any HTML that you should be indexing at all for searching is the character data between tags.

We're not currently parsing the HTML, so none of these distinctions are
really available to us. Maybe adding an HTML parser is the right
solution, but it's a bit non-trivial.

d
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: [PATCH] test: add known broken test for indexing html

In reply to this post by Jeffrey Stedfast-2
Jeffrey Stedfast <[hidden email]> writes:

> Base64 encoded inline image data is always within the src attribute
> value of an <img> tag and will always begin with "data:" followed by
> the mime-type and then followed by ";base64," so it's pretty easy to
> spot.
>
> While on this topic, why index HTML attribute values at all? Other
>than perhaps some known ones like perhaps the 'alt' value of <img>
>tags?
>
> I would argue that the only portion of any HTML that you should be
> indexing at all for searching is the character data between tags.
>

I should mention that we also have a fair amount of base64 gunk from
inline PGP signatures. I'm not sure if it's just ugly to look at when
dumping the database term, or if it actually makes a measurable
difference in time/space usage.

d
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Jeffrey Stedfast-2 Jeffrey Stedfast-2
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: [PATCH] test: add known broken test for indexing html

In reply to this post by David Bremner-2
Hey David,

I actually have an HTML tokenizer for MimeKit for (among other things) this type of purpose. Perhaps I need to port that to C and include that with GMime 😊

https://github.com/jstedfast/MimeKit/tree/master/MimeKit/Text

Jeff

> -----Original Message-----
> From: David Bremner [mailto:[hidden email]]
> Sent: Saturday, March 18, 2017 11:04 AM
> To: Jeffrey Stedfast <[hidden email]>; [hidden email]
> Subject: RE: [PATCH] test: add known broken test for indexing html
>
> Jeffrey Stedfast <[hidden email]> writes:
>
> > Hi David,
> >
> > Base64 encoded inline image data is always within the src attribute value of
> an <img> tag and will always begin with "data:" followed by the mime-type
> and then followed by ";base64," so it's pretty easy to spot.
> >
> > While on this topic, why index HTML attribute values at all? Other than
> perhaps some known ones like perhaps the 'alt' value of <img> tags?
> >
> > I would argue that the only portion of any HTML that you should be
> indexing at all for searching is the character data between tags.
>
> We're not currently parsing the HTML, so none of these distinctions are really
> available to us. Maybe adding an HTML parser is the right solution, but it's a
> bit non-trivial.
>
> d
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: [PATCH] test: add known broken test for indexing html

Jeffrey Stedfast <[hidden email]> writes:

> Hey David,
>
> I actually have an HTML tokenizer for MimeKit for (among other things) this type of purpose. Perhaps I need to port that to C and include that with GMime 😊
>
> https://github.com/jstedfast/MimeKit/tree/master/MimeKit/Text
>
> Jeff

That's probably a good idea in your abundant spare time ;).  More
generally though we've thought about letting users provide filters to
convert attachements (e.g. .odt / .docx / pdf) to text. I'm not sure
about the performance hit, but I guess that would work for html as well.
I guess in principle it should be possible to write GMime filter that
manages the child process.

d
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Jeffrey Stedfast-2 Jeffrey Stedfast-2
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: [PATCH] test: add known broken test for indexing html


> -----Original Message-----
> From: David Bremner [mailto:[hidden email]]
> Sent: Saturday, March 18, 2017 2:15 PM
> To: Jeffrey Stedfast <[hidden email]>; [hidden email]
> Subject: RE: [PATCH] test: add known broken test for indexing html
>
> Jeffrey Stedfast <[hidden email]> writes:
>
> > Hey David,
> >
> > I actually have an HTML tokenizer for MimeKit for (among other things)
> > this type of purpose. Perhaps I need to port that to C and include
> > that with GMime 😊
> >
> > https://github.com/jstedfast/MimeKit/tree/master/MimeKit/Text
> >
> > Jeff
>
> That's probably a good idea in your abundant spare time ;).  More generally
> though we've thought about letting users provide filters to convert
> attachements (e.g. .odt / .docx / pdf) to text. I'm not sure about the
> performance hit, but I guess that would work for html as well.
> I guess in principle it should be possible to write GMime filter that manages
> the child process.
>
> d


Hah, yea... it'll probably be awhile. I need to focus on GMime 3.0 first. Once I get that squared away, I can look at porting other handy features back from MimeKit 😊

Jeff

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Loading...