Quantcast

Drop HTML tags when indexing

classic Classic list List threaded Threaded
11 messages Options
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Drop HTML tags when indexing

Steven Allen pointed out [2] that the previous scanner [1] was a
little too simplistic. This version handles (or claims to) quoted
strings in attributes, which can apparently contain '>'and '<'
characters. This required generalizing the state machine runner a bit
[3] to handle states with out-degree more than two.


[1]: id:[hidden email]
[2]: id:[hidden email]
[3]:
diff --git a/lib/index.cc b/lib/index.cc
index 03223f7d..324e6e79 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -122,23 +122,25 @@ do_filter (const scanner_state_t states[],
     register const char *inptr = inbuf;
     const char *inend = inbuf + inlen;
     char *outptr;
-    int next;
+    int next, current;
     (void) prespace;
 
 
     g_mime_filter_set_size (gmime_filter, inlen, FALSE);
     outptr = gmime_filter->outbuf;
 
+    current = filter->state;
     while (inptr < inend) {
- if (*inptr >= states[filter->state].a &&
-    *inptr <= states[filter->state].b)
- {
-    next = states[filter->state].next_if_match;
- }
- else
- {
-    next = states[filter->state].next_if_not_match;
- }
+ /* do "fake transitions" until we fire a rule, or run out of rules */
+ do {
+    if (*inptr >= states[current].a && *inptr <= states[current].b)  {
+ next = states[current].next_if_match;
+    } else  {
+ next = states[current].next_if_not_match;
+    }
+
+    current = next;
+ } while (next != states[next].state);
 
  if (filter->state < first_skipping_state)
     *outptr++ = *inptr;
@@ -209,7 +211,11 @@ filter_filter_html (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t
 {
     static const scanner_state_t states[] = {
  {0,  '<',  '<',  1,  0},
+ {1,  '\'', '\'', 4,  2},  /* scanning for quote or > */
+ {1,  '"',  '"',  5,  3},
  {1,  '>',  '>',  0,  1},
+ {4,  '\'', '\'', 1,  4},  /* inside single quotes */
+ {5,  '"', '"',   1,  5},  /* inside double quotes */
     };
     do_filter(states, 1,
       gmime_filter, inbuf, inlen, prespace, outbuf, outlen, outprespace);
diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh
index ee69209c..74f33708 100755
--- a/test/T680-html-indexing.sh
+++ b/test/T680-html-indexing.sh
@@ -8,4 +8,15 @@ test_begin_subtest 'embedded images should not be indexed'
 notmuch search kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 > OUTPUT
 test_expect_equal_file /dev/null OUTPUT
 
+test_begin_subtest 'ignore > in attribute text'
+notmuch search swordfish | notmuch_search_sanitize > OUTPUT
+test_expect_equal_file /dev/null OUTPUT
+
+test_begin_subtest 'non tag text should be indexed'
+notmuch search hunter2 | notmuch_search_sanitize > OUTPUT
+cat <<EOF > EXPECTED
+thread:XXX   2009-11-17 [1/1] David Bremner; test html attachment (inbox unread)
+EOF
+test_expect_equal_file EXPECTED OUTPUT
+
 test_done
diff --git a/test/corpora/html/attribute-text b/test/corpora/html/attribute-text
new file mode 100644
index 00000000..6dae8194
--- /dev/null
+++ b/test/corpora/html/attribute-text
@@ -0,0 +1,15 @@
+From: David Bremner <[hidden email]>
+To: David Bremner <[hidden email]>
+Subject: test html attachment
+Date: Tue, 17 Nov 2009 21:28:38 +0600
+Message-ID: <[hidden email]>
+MIME-Version: 1.0
+Content-Type: text/html
+Content-Disposition: inline; filename=test.html
+
+<html>
+  <body>
+    <input value="a>swordfish">
+  </body>
+  hunter2
+</html>

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 1/7] test: add known broken test for indexing html

'quite' on IRC reported that notmuch new was grinding to a halt during
initial indexing, and we eventually narrowed the problem down to some
html parts with large embedded images. These cause the number of terms
added to the Xapian database to explode (the first 400 messages
generated 4.6M unique terms), and of course the resulting terms are
not much use for searching.

The second test is sanity check for any "improved" indexing of HTML.
---
 test/T680-html-indexing.sh       | 19 +++++++++++
 test/corpora/README              |  3 ++
 test/corpora/html/attribute-text | 15 +++++++++
 test/corpora/html/embedded-image | 69 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 106 insertions(+)
 create mode 100755 test/T680-html-indexing.sh
 create mode 100644 test/corpora/html/attribute-text
 create mode 100644 test/corpora/html/embedded-image

diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh
new file mode 100755
index 00000000..5e9cc4cb
--- /dev/null
+++ b/test/T680-html-indexing.sh
@@ -0,0 +1,19 @@
+#!/usr/bin/env bash
+test_description="indexing of html parts"
+. ./test-lib.sh || exit 1
+
+add_email_corpus html
+
+test_begin_subtest 'embedded images should not be indexed'
+test_subtest_known_broken
+notmuch search kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 > OUTPUT
+test_expect_equal_file /dev/null OUTPUT
+
+test_begin_subtest 'non tag text should be indexed'
+notmuch search hunter2 | notmuch_search_sanitize > OUTPUT
+cat <<EOF > EXPECTED
+thread:XXX   2009-11-17 [1/1] David Bremner; test html attachment (inbox unread)
+EOF
+test_expect_equal_file EXPECTED OUTPUT
+
+test_done
diff --git a/test/corpora/README b/test/corpora/README
index 77c48e6e..c9a35fed 100644
--- a/test/corpora/README
+++ b/test/corpora/README
@@ -9,3 +9,6 @@ default
 broken
   The broken corpus contains messages that are broken and/or RFC
   non-compliant, ensuring we deal with them in a sane way.
+
+html
+  The html corpus contains html parts
diff --git a/test/corpora/html/attribute-text b/test/corpora/html/attribute-text
new file mode 100644
index 00000000..6dae8194
--- /dev/null
+++ b/test/corpora/html/attribute-text
@@ -0,0 +1,15 @@
+From: David Bremner <[hidden email]>
+To: David Bremner <[hidden email]>
+Subject: test html attachment
+Date: Tue, 17 Nov 2009 21:28:38 +0600
+Message-ID: <[hidden email]>
+MIME-Version: 1.0
+Content-Type: text/html
+Content-Disposition: inline; filename=test.html
+
+<html>
+  <body>
+    <input value="a>swordfish">
+  </body>
+  hunter2
+</html>
diff --git a/test/corpora/html/embedded-image b/test/corpora/html/embedded-image
new file mode 100644
index 00000000..40851530
--- /dev/null
+++ b/test/corpora/html/embedded-image
@@ -0,0 +1,69 @@
+From: =?utf-8?b?bWFsbW9ib3Jn?= <[hidden email]>
+To: =?utf-8?b?Ym9lbmRlLm1hbG1vYm9yZw==?= <[hidden email]>
+Date: Tue, 19 Jul 2016 11:54:24 +0200
+X-Feed2Imap-Version: 1.2.5
+Message-Id: <[hidden email]>
+Subject: =?utf-8?b?VGFjayBhbGxhIHRyYWZpa2FudGVyIG9jaCBmb3Rnw6RuZ2FyZSE=?=
+Content-Type: multipart/alternative; boundary="=-1468922508-176605-12427-9500-21-="
+MIME-Version: 1.0
+
+
+--=-1468922508-176605-12427-9500-21-=
+Content-Type: text/plain; charset=utf-8; format=flowed
+Content-Transfer-Encoding: 8bit
+
+<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/>
+
+Malmö 2016-07-09
+
+I skrivande stund är vi i färd med att avetablera vår entreprenad på
+Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett större
+dräneringsarbete som i sin tur har inneburit vissa
+trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några
+veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och vi
+kan glatt meddela att båda vägfilerna kommer att öppnas inom kort. Nu
+kommer den vackra fastigheten att klara sig torrskodd under många år
+framöver [A]
+

+
+[A] http://malmoborg.se/wp-includes/images/smilies/icon_smile.gif
+--
+Feed: Förvaltnings AB Malmöborg
+<http://malmoborg.se>
+Item: Tack alla trafikanter och fotgängare!
+<http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/>
+Date: 2016-07-19 11:54:24 +0200
+Author: malmoborg
+Filed under: Nyheter
+
+--=-1468922508-176605-12427-9500-21-=
+Content-Type: text/html; charset=utf-8
+Content-Transfer-Encoding: 8bit
+
+<table border="1" width="100%" cellpadding="0" cellspacing="0" borderspacing="0"><tr><td>
+<table width="100%" bgcolor="#EDEDED" cellpadding="4" cellspacing="2">
+<tr><td align="right"><b>Feed:</b></td>
+<td width="100%"><a href="http://malmoborg.se">
+<b>Förvaltnings AB Malmöborg</b>
+</a>
+</td></tr><tr><td align="right"><b>Item:</b></td>
+<td width="100%"><a href="http://malmoborg.se/2016/07/tack-alla-trafikanter-och-fotgangare/"><b>Tack alla trafikanter och fotgängare!</b>
+</a>
+</td></tr></table></td></tr></table>
+
+<p>Malmö 2016-07-09</p>
+<p>I skrivande stund är vi i färd med att avetablera vår entreprenad på Tigern 3, Regementsgatan 6 i Malmö. Fastigheten har genomgått ett större dräneringsarbete som i sin tur har inneburit vissa trafikbegränsningar på Regementsgatan samt Davidshallsgatan under några veckors tid. Fastighetsägaren är mycket nöjd med vår arbetsinsats och vi kan glatt meddela att båda vägfilerna kommer att öppnas inom kort. Nu kommer den vackra fastigheten att klara sig torrskodd under många år framöver <img src="data:image/gif;base64,R0lGODlhDwAPALMOAP/qAEVFRQAAAP/OAP/JAP+0AP6dAP/+k//9E///////
+xzMzM///6//lAAAAAAAAACH5BAEAAA4ALAAAAAAPAA8AAARb0EkZap3YVabO
+GRcWcAgCnIMRTEEnCCfwpqt2mHEOagoOnz+CKnADxoKFyiHHBBCSAdOiCVg8
+KwPZa7sVrgJZQWI8FhB2msGgwTXTWGqCXP4WBQr4wjDDstQmEQA7
+" alt=":-)" class="wp-smiley" /> </p>
+<p>&nbsp;</p>
+<hr width="100%"/>
+<table width="100%" cellpadding="0" cellspacing="0">
+<tr><td align="right"><font color="#ababab">Date:</font>&nbsp;&nbsp;</td><td><font color="#ababab">2016-07-19 11:54:24 +0200</font></td></tr>
+<tr><td align="right"><font color="#ababab">Author:</font>&nbsp;&nbsp;</td><td><font color="#ababab">malmoborg</font></td></tr>
+<tr><td align="right"><font color="#ababab">Filed under:</font>&nbsp;&nbsp;</td><td><font color="#ababab">Nyheter</font></td></tr>
+</table>
+
+--=-1468922508-176605-12427-9500-21-=--
--
2.11.0

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 2/7] lib: add content type argument to uuencode filter.

In reply to this post by David Bremner-2
The idea is to support more general types of filtering, based on
content type.
---
 lib/index.cc | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 8c145540..1c04cc3d 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -56,6 +56,7 @@ typedef struct _NotmuchFilterDiscardUuencodeClass NotmuchFilterDiscardUuencodeCl
  **/
 struct _NotmuchFilterDiscardUuencode {
     GMimeFilter parent_object;
+    GMimeContentType *content_type;
     int state;
 };
 
@@ -63,7 +64,7 @@ struct _NotmuchFilterDiscardUuencodeClass {
     GMimeFilterClass parent_class;
 };
 
-static GMimeFilter *notmuch_filter_discard_uuencode_new (void);
+static GMimeFilter *notmuch_filter_discard_uuencode_new (GMimeContentType *content);
 
 static void notmuch_filter_discard_uuencode_finalize (GObject *object);
 
@@ -102,8 +103,9 @@ notmuch_filter_discard_uuencode_finalize (GObject *object)
 static GMimeFilter *
 filter_copy (GMimeFilter *gmime_filter)
 {
-    (void) gmime_filter;
-    return notmuch_filter_discard_uuencode_new ();
+    NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
+
+    return notmuch_filter_discard_uuencode_new (filter->content_type);
 }
 
 static void
@@ -196,7 +198,7 @@ filter_reset (GMimeFilter *gmime_filter)
  * Returns: a new #NotmuchFilterDiscardUuencode filter.
  **/
 static GMimeFilter *
-notmuch_filter_discard_uuencode_new (void)
+notmuch_filter_discard_uuencode_new (GMimeContentType *content_type)
 {
     static GType type = 0;
     NotmuchFilterDiscardUuencode *filter;
@@ -220,6 +222,7 @@ notmuch_filter_discard_uuencode_new (void)
 
     filter = (NotmuchFilterDiscardUuencode *) g_object_newv (type, 0, NULL);
     filter->state = 0;
+    filter->content_type = content_type;
 
     return (GMimeFilter *) filter;
 }
@@ -396,7 +399,7 @@ _index_mime_part (notmuch_message_t *message,
     g_mime_stream_mem_set_owner (GMIME_STREAM_MEM (stream), FALSE);
 
     filter = g_mime_stream_filter_new (stream);
-    discard_uuencode_filter = notmuch_filter_discard_uuencode_new ();
+    discard_uuencode_filter = notmuch_filter_discard_uuencode_new (content_type);
 
     g_mime_stream_filter_add (GMIME_STREAM_FILTER (filter),
       discard_uuencode_filter);
--
2.11.0

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 3/7] lib/index: Add another layer of indirection in filtering

In reply to this post by David Bremner-2
We could add a second gmime filter subclass, but prefer to avoid
duplicating the boilerplate.
---
 lib/index.cc | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 1c04cc3d..74a750b9 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -29,6 +29,8 @@
 typedef struct _NotmuchFilterDiscardUuencode NotmuchFilterDiscardUuencode;
 typedef struct _NotmuchFilterDiscardUuencodeClass NotmuchFilterDiscardUuencodeClass;
 
+typedef void (*filter_fun) (GMimeFilter *filter, char *in, size_t len, size_t prespace,
+    char **out, size_t *outlen, size_t *outprespace);
 /**
  * NotmuchFilterDiscardUuencode:
  *
@@ -57,6 +59,7 @@ typedef struct _NotmuchFilterDiscardUuencodeClass NotmuchFilterDiscardUuencodeCl
 struct _NotmuchFilterDiscardUuencode {
     GMimeFilter parent_object;
     GMimeContentType *content_type;
+    filter_fun real_filter;
     int state;
 };
 
@@ -110,7 +113,14 @@ filter_copy (GMimeFilter *gmime_filter)
 
 static void
 filter_filter (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
-       char **outbuf, size_t *outlen, size_t *outprespace)
+       char **outbuf, size_t *outlen, size_t *outprespace) {
+    NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
+    (*filter->real_filter)(gmime_filter, inbuf, inlen, prespace, outbuf, outlen, outprespace);
+}
+
+static void
+filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
+ char **outbuf, size_t *outlen, size_t *outprespace)
 {
     NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
     register const char *inptr = inbuf;
@@ -223,7 +233,7 @@ notmuch_filter_discard_uuencode_new (GMimeContentType *content_type)
     filter = (NotmuchFilterDiscardUuencode *) g_object_newv (type, 0, NULL);
     filter->state = 0;
     filter->content_type = content_type;
-
+    filter->real_filter = filter_filter_uuencode;
     return (GMimeFilter *) filter;
 }
 
--
2.11.0

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 4/7] lib/index: separate state table definition from scanner.

In reply to this post by David Bremner-2
We want to reuse the scanner definition with a different table
---
 lib/index.cc | 81 +++++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 47 insertions(+), 34 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 74a750b9..02b35b81 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -31,6 +31,15 @@ typedef struct _NotmuchFilterDiscardUuencodeClass NotmuchFilterDiscardUuencodeCl
 
 typedef void (*filter_fun) (GMimeFilter *filter, char *in, size_t len, size_t prespace,
     char **out, size_t *outlen, size_t *outprespace);
+
+typedef struct {
+    int state;
+    int a;
+    int b;
+    int next_if_match;
+    int next_if_not_match;
+} scanner_state_t;
+
 /**
  * NotmuchFilterDiscardUuencode:
  *
@@ -119,46 +128,18 @@ filter_filter (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t pres
 }
 
 static void
-filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
- char **outbuf, size_t *outlen, size_t *outprespace)
+do_filter (const scanner_state_t states[],
+   int first_skipping_state,
+   GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
+   char **outbuf, size_t *outlen, size_t *outprespace)
 {
     NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
     register const char *inptr = inbuf;
     const char *inend = inbuf + inlen;
     char *outptr;
-
+    int next;
     (void) prespace;
 
-    /* Simple, linear state-transition diagram for our filter.
-     *
-     * If the character being processed is within the range of [a, b]
-     * for the current state then we transition next_if_match
-     * state. If not, we transition to the next_if_not_match state.
-     *
-     * The final two states are special in that they are the states in
-     * which we discard data. */
-    static const struct {
- int state;
- int a;
- int b;
- int next_if_match;
- int next_if_not_match;
-    } states[] = {
- {0,  'b',  'b',  1,  0},
- {1,  'e',  'e',  2,  0},
- {2,  'g',  'g',  3,  0},
- {3,  'i',  'i',  4,  0},
- {4,  'n',  'n',  5,  0},
- {5,  ' ',  ' ',  6,  0},
- {6,  '0',  '7',  7,  0},
- {7,  '0',  '7',  8,  0},
- {8,  '0',  '7',  9,  0},
- {9,  ' ',  ' ',  10, 0},
- {10, '\n', '\n', 11, 10},
- {11, 'M',  'M',  12, 0},
- {12, ' ',  '`',  12, 11}
-    };
-    int next;
 
     g_mime_filter_set_size (gmime_filter, inlen, FALSE);
     outptr = gmime_filter->outbuf;
@@ -174,7 +155,7 @@ filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, si
     next = states[filter->state].next_if_not_match;
  }
 
- if (filter->state < 11)
+ if (filter->state < first_skipping_state)
     *outptr++ = *inptr;
 
  filter->state = next;
@@ -187,6 +168,38 @@ filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, si
 }
 
 static void
+filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
+ char **outbuf, size_t *outlen, size_t *outprespace)
+{
+    /* Simple, linear state-transition diagram for our filter.
+     *
+     * If the character being processed is within the range of [a, b]
+     * for the current state then we transition next_if_match
+     * state. If not, we transition to the next_if_not_match state.
+     *
+     * The final two states are special in that they are the states in
+     * which we discard data. */
+    static const scanner_state_t states[] = {
+ {0,  'b',  'b',  1,  0},
+ {1,  'e',  'e',  2,  0},
+ {2,  'g',  'g',  3,  0},
+ {3,  'i',  'i',  4,  0},
+ {4,  'n',  'n',  5,  0},
+ {5,  ' ',  ' ',  6,  0},
+ {6,  '0',  '7',  7,  0},
+ {7,  '0',  '7',  8,  0},
+ {8,  '0',  '7',  9,  0},
+ {9,  ' ',  ' ',  10, 0},
+ {10, '\n', '\n', 11, 10},
+ {11, 'M',  'M',  12, 0},
+ {12, ' ',  '`',  12, 11}
+    };
+
+    do_filter(states, 11,
+      gmime_filter, inbuf, inlen, prespace, outbuf, outlen, outprespace);
+}
+
+static void
 filter_complete (GMimeFilter *filter, char *inbuf, size_t inlen, size_t prespace,
  char **outbuf, size_t *outlen, size_t *outprespace)
 {
--
2.11.0

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 5/7] lib/index: generalize filter name

In reply to this post by David Bremner-2
We can't very well call it uuencode if it is going to filter other
things as well.
---
 lib/index.cc | 92 +++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 48 insertions(+), 44 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 02b35b81..3bb1ac1c 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -26,8 +26,8 @@
 
 /* Oh, how I wish that gobject didn't require so much noisy boilerplate!
  * (Though I have at least eliminated some of the stock set...) */
-typedef struct _NotmuchFilterDiscardUuencode NotmuchFilterDiscardUuencode;
-typedef struct _NotmuchFilterDiscardUuencodeClass NotmuchFilterDiscardUuencodeClass;
+typedef struct _NotmuchFilterDiscardNonTerms NotmuchFilterDiscardNonTerms;
+typedef struct _NotmuchFilterDiscardNonTermsClass NotmuchFilterDiscardNonTermsClass;
 
 typedef void (*filter_fun) (GMimeFilter *filter, char *in, size_t len, size_t prespace,
     char **out, size_t *outlen, size_t *outprespace);
@@ -41,44 +41,29 @@ typedef struct {
 } scanner_state_t;
 
 /**
- * NotmuchFilterDiscardUuencode:
+ * NotmuchFilterDiscardNonTerms:
  *
  * @parent_object: parent #GMimeFilter
  * @encode: encoding vs decoding
  * @state: State of the parser
  *
- * A filter to discard uuencoded portions of an email.
- *
- * A uuencoded portion is identified as beginning with a line
- * matching:
- *
- * begin [0-7][0-7][0-7] .*
- *
- * After that detection, and beginning with the following line,
- * characters will be discarded as long as the first character of each
- * line begins with M and subsequent characters on the line are within
- * the range of ASCII characters from ' ' to '`'.
- *
- * This is not a perfect UUencode filter. It's possible to have a
- * message that will legitimately match that pattern, (so that some
- * legitimate content is discarded). And for most UUencoded files, the
- * final line of encoded data (the line not starting with M) will be
- * indexed.
+ * A filter to discard non terms portions of an email, i.e. stuff not
+ * worth indexing.
  **/
-struct _NotmuchFilterDiscardUuencode {
+struct _NotmuchFilterDiscardNonTerms {
     GMimeFilter parent_object;
     GMimeContentType *content_type;
     filter_fun real_filter;
     int state;
 };
 
-struct _NotmuchFilterDiscardUuencodeClass {
+struct _NotmuchFilterDiscardNonTermsClass {
     GMimeFilterClass parent_class;
 };
 
-static GMimeFilter *notmuch_filter_discard_uuencode_new (GMimeContentType *content);
+static GMimeFilter *notmuch_filter_discard_non_terms_new (GMimeContentType *content);
 
-static void notmuch_filter_discard_uuencode_finalize (GObject *object);
+static void notmuch_filter_discard_non_terms_finalize (GObject *object);
 
 static GMimeFilter *filter_copy (GMimeFilter *filter);
 static void filter_filter (GMimeFilter *filter, char *in, size_t len, size_t prespace,
@@ -91,14 +76,14 @@ static void filter_reset (GMimeFilter *filter);
 static GMimeFilterClass *parent_class = NULL;
 
 static void
-notmuch_filter_discard_uuencode_class_init (NotmuchFilterDiscardUuencodeClass *klass)
+notmuch_filter_discard_non_terms_class_init (NotmuchFilterDiscardNonTermsClass *klass)
 {
     GObjectClass *object_class = G_OBJECT_CLASS (klass);
     GMimeFilterClass *filter_class = GMIME_FILTER_CLASS (klass);
 
     parent_class = (GMimeFilterClass *) g_type_class_ref (GMIME_TYPE_FILTER);
 
-    object_class->finalize = notmuch_filter_discard_uuencode_finalize;
+    object_class->finalize = notmuch_filter_discard_non_terms_finalize;
 
     filter_class->copy = filter_copy;
     filter_class->filter = filter_filter;
@@ -107,7 +92,7 @@ notmuch_filter_discard_uuencode_class_init (NotmuchFilterDiscardUuencodeClass *k
 }
 
 static void
-notmuch_filter_discard_uuencode_finalize (GObject *object)
+notmuch_filter_discard_non_terms_finalize (GObject *object)
 {
     G_OBJECT_CLASS (parent_class)->finalize (object);
 }
@@ -115,15 +100,15 @@ notmuch_filter_discard_uuencode_finalize (GObject *object)
 static GMimeFilter *
 filter_copy (GMimeFilter *gmime_filter)
 {
-    NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
+    NotmuchFilterDiscardNonTerms *filter = (NotmuchFilterDiscardNonTerms *) gmime_filter;
 
-    return notmuch_filter_discard_uuencode_new (filter->content_type);
+    return notmuch_filter_discard_non_terms_new (filter->content_type);
 }
 
 static void
 filter_filter (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
        char **outbuf, size_t *outlen, size_t *outprespace) {
-    NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
+    NotmuchFilterDiscardNonTerms *filter = (NotmuchFilterDiscardNonTerms *) gmime_filter;
     (*filter->real_filter)(gmime_filter, inbuf, inlen, prespace, outbuf, outlen, outprespace);
 }
 
@@ -133,7 +118,7 @@ do_filter (const scanner_state_t states[],
    GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
    char **outbuf, size_t *outlen, size_t *outprespace)
 {
-    NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
+    NotmuchFilterDiscardNonTerms *filter = (NotmuchFilterDiscardNonTerms *) gmime_filter;
     register const char *inptr = inbuf;
     const char *inend = inbuf + inlen;
     char *outptr;
@@ -167,6 +152,25 @@ do_filter (const scanner_state_t states[],
     *outbuf = gmime_filter->outbuf;
 }
 
+/*
+ *
+ * A uuencoded portion is identified as beginning with a line
+ * matching:
+ *
+ * begin [0-7][0-7][0-7] .*
+ *
+ * After that detection, and beginning with the following line,
+ * characters will be discarded as long as the first character of each
+ * line begins with M and subsequent characters on the line are within
+ * the range of ASCII characters from ' ' to '`'.
+ *
+ * This is not a perfect UUencode filter. It's possible to have a
+ * message that will legitimately match that pattern, (so that some
+ * legitimate content is discarded). And for most UUencoded files, the
+ * final line of encoded data (the line not starting with M) will be
+ * indexed.
+ */
+
 static void
 filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
  char **outbuf, size_t *outlen, size_t *outprespace)
@@ -210,7 +214,7 @@ filter_complete (GMimeFilter *filter, char *inbuf, size_t inlen, size_t prespace
 static void
 filter_reset (GMimeFilter *gmime_filter)
 {
-    NotmuchFilterDiscardUuencode *filter = (NotmuchFilterDiscardUuencode *) gmime_filter;
+    NotmuchFilterDiscardNonTerms *filter = (NotmuchFilterDiscardNonTerms *) gmime_filter;
 
     filter->state = 0;
 }
@@ -218,32 +222,32 @@ filter_reset (GMimeFilter *gmime_filter)
 /**
  * notmuch_filter_discard_uuencode_new:
  *
- * Returns: a new #NotmuchFilterDiscardUuencode filter.
+ * Returns: a new #NotmuchFilterDiscardNonTerms filter.
  **/
 static GMimeFilter *
-notmuch_filter_discard_uuencode_new (GMimeContentType *content_type)
+notmuch_filter_discard_non_terms_new (GMimeContentType *content_type)
 {
     static GType type = 0;
-    NotmuchFilterDiscardUuencode *filter;
+    NotmuchFilterDiscardNonTerms *filter;
 
     if (!type) {
  static const GTypeInfo info = {
-    sizeof (NotmuchFilterDiscardUuencodeClass),
+    sizeof (NotmuchFilterDiscardNonTermsClass),
     NULL, /* base_class_init */
     NULL, /* base_class_finalize */
-    (GClassInitFunc) notmuch_filter_discard_uuencode_class_init,
+    (GClassInitFunc) notmuch_filter_discard_non_terms_class_init,
     NULL, /* class_finalize */
     NULL, /* class_data */
-    sizeof (NotmuchFilterDiscardUuencode),
+    sizeof (NotmuchFilterDiscardNonTerms),
     0,    /* n_preallocs */
     NULL, /* instance_init */
     NULL  /* value_table */
  };
 
- type = g_type_register_static (GMIME_TYPE_FILTER, "NotmuchFilterDiscardUuencode", &info, (GTypeFlags) 0);
+ type = g_type_register_static (GMIME_TYPE_FILTER, "NotmuchFilterDiscardNonTerms", &info, (GTypeFlags) 0);
     }
 
-    filter = (NotmuchFilterDiscardUuencode *) g_object_newv (type, 0, NULL);
+    filter = (NotmuchFilterDiscardNonTerms *) g_object_newv (type, 0, NULL);
     filter->state = 0;
     filter->content_type = content_type;
     filter->real_filter = filter_filter_uuencode;
@@ -332,7 +336,7 @@ _index_mime_part (notmuch_message_t *message,
   GMimeObject *part)
 {
     GMimeStream *stream, *filter;
-    GMimeFilter *discard_uuencode_filter;
+    GMimeFilter *discard_non_terms_filter;
     GMimeDataWrapper *wrapper;
     GByteArray *byte_array;
     GMimeContentDisposition *disposition;
@@ -422,10 +426,10 @@ _index_mime_part (notmuch_message_t *message,
     g_mime_stream_mem_set_owner (GMIME_STREAM_MEM (stream), FALSE);
 
     filter = g_mime_stream_filter_new (stream);
-    discard_uuencode_filter = notmuch_filter_discard_uuencode_new (content_type);
+    discard_non_terms_filter = notmuch_filter_discard_non_terms_new (content_type);
 
     g_mime_stream_filter_add (GMIME_STREAM_FILTER (filter),
-      discard_uuencode_filter);
+      discard_non_terms_filter);
 
     charset = g_mime_object_get_content_type_parameter (part, "charset");
     if (charset) {
@@ -447,7 +451,7 @@ _index_mime_part (notmuch_message_t *message,
 
     g_object_unref (stream);
     g_object_unref (filter);
-    g_object_unref (discard_uuencode_filter);
+    g_object_unref (discard_non_terms_filter);
 
     g_byte_array_append (byte_array, (guint8 *) "\0", 1);
     body = (char *) g_byte_array_free (byte_array, FALSE);
--
2.11.0

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 6/7] lib/index.cc: generalize filter state machine

In reply to this post by David Bremner-2
To match things more complicated than fixed strings, we need states
with multiple out arrows.
---
 lib/index.cc | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index 3bb1ac1c..fd66762c 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -122,23 +122,25 @@ do_filter (const scanner_state_t states[],
     register const char *inptr = inbuf;
     const char *inend = inbuf + inlen;
     char *outptr;
-    int next;
+    int next, current;
     (void) prespace;
 
 
     g_mime_filter_set_size (gmime_filter, inlen, FALSE);
     outptr = gmime_filter->outbuf;
 
+    current = filter->state;
     while (inptr < inend) {
- if (*inptr >= states[filter->state].a &&
-    *inptr <= states[filter->state].b)
- {
-    next = states[filter->state].next_if_match;
- }
- else
- {
-    next = states[filter->state].next_if_not_match;
- }
+ /* do "fake transitions" until we fire a rule, or run out of rules */
+ do {
+    if (*inptr >= states[current].a && *inptr <= states[current].b)  {
+ next = states[current].next_if_match;
+    } else  {
+ next = states[current].next_if_not_match;
+    }
+
+    current = next;
+ } while (next != states[next].state);
 
  if (filter->state < first_skipping_state)
     *outptr++ = *inptr;
--
2.11.0

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 7/7] lib/index: add simple html filter

In reply to this post by David Bremner-2
Just drop all tags
---
 lib/index.cc               | 21 ++++++++++++++++++++-
 test/T680-html-indexing.sh |  5 ++++-
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/lib/index.cc b/lib/index.cc
index fd66762c..324e6e79 100644
--- a/lib/index.cc
+++ b/lib/index.cc
@@ -206,6 +206,22 @@ filter_filter_uuencode (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, si
 }
 
 static void
+filter_filter_html (GMimeFilter *gmime_filter, char *inbuf, size_t inlen, size_t prespace,
+    char **outbuf, size_t *outlen, size_t *outprespace)
+{
+    static const scanner_state_t states[] = {
+ {0,  '<',  '<',  1,  0},
+ {1,  '\'', '\'', 4,  2},  /* scanning for quote or > */
+ {1,  '"',  '"',  5,  3},
+ {1,  '>',  '>',  0,  1},
+ {4,  '\'', '\'', 1,  4},  /* inside single quotes */
+ {5,  '"', '"',   1,  5},  /* inside double quotes */
+    };
+    do_filter(states, 1,
+      gmime_filter, inbuf, inlen, prespace, outbuf, outlen, outprespace);
+}
+
+static void
 filter_complete (GMimeFilter *filter, char *inbuf, size_t inlen, size_t prespace,
  char **outbuf, size_t *outlen, size_t *outprespace)
 {
@@ -252,7 +268,10 @@ notmuch_filter_discard_non_terms_new (GMimeContentType *content_type)
     filter = (NotmuchFilterDiscardNonTerms *) g_object_newv (type, 0, NULL);
     filter->state = 0;
     filter->content_type = content_type;
-    filter->real_filter = filter_filter_uuencode;
+    if (g_mime_content_type_is_type (content_type, "text", "html"))
+ filter->real_filter = filter_filter_html;
+    else
+ filter->real_filter = filter_filter_uuencode;
     return (GMimeFilter *) filter;
 }
 
diff --git a/test/T680-html-indexing.sh b/test/T680-html-indexing.sh
index 5e9cc4cb..74f33708 100755
--- a/test/T680-html-indexing.sh
+++ b/test/T680-html-indexing.sh
@@ -5,10 +5,13 @@ test_description="indexing of html parts"
 add_email_corpus html
 
 test_begin_subtest 'embedded images should not be indexed'
-test_subtest_known_broken
 notmuch search kwpza7svrgjzqwi8fhb2msggwtxtwgqcxp4wbqr4wjddstqmeqa7 > OUTPUT
 test_expect_equal_file /dev/null OUTPUT
 
+test_begin_subtest 'ignore > in attribute text'
+notmuch search swordfish | notmuch_search_sanitize > OUTPUT
+test_expect_equal_file /dev/null OUTPUT
+
 test_begin_subtest 'non tag text should be indexed'
 notmuch search hunter2 | notmuch_search_sanitize > OUTPUT
 cat <<EOF > EXPECTED
--
2.11.0

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Daniel Lublin (quite) Daniel Lublin (quite)
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Drop HTML tags when indexing

In reply to this post by David Bremner-2
This patch is good. notmuch now gets through my whole archive of 175k mails,
memory usage peaking at 430M.
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Drop HTML tags when indexing

In reply to this post by David Bremner-2
David Bremner <[hidden email]> writes:

> Steven Allen pointed out [2] that the previous scanner [1] was a
> little too simplistic. This version handles (or claims to) quoted
> strings in attributes, which can apparently contain '>'and '<'
> characters. This required generalizing the state machine runner a bit
> [3] to handle states with out-degree more than two.

For what it is worth, this series shrunk my index by about the same
amount as skipping html messages entirely: I have about 15% messages
with html parts, and this series made the index about 15% smaller.

d
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 1/7] test: add known broken test for indexing html

In reply to this post by David Bremner-2
David Bremner <[hidden email]> writes:

> 'quite' on IRC reported that notmuch new was grinding to a halt during
> initial indexing, and we eventually narrowed the problem down to some
> html parts with large embedded images. These cause the number of terms
> added to the Xapian database to explode (the first 400 messages
> generated 4.6M unique terms), and of course the resulting terms are
> not much use for searching.
>
> The second test is sanity check for any "improved" indexing of HTML.

pushed the first patch in the series to master.

d
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
Loading...