[David Bremner] Re: RFC: drop html tags

classic Classic list List threaded Threaded
2 messages Options
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

[David Bremner] Re: RFC: drop html tags

Steven Allen <[hidden email]> writes:

> David Bremner <[hidden email]> writes:
>> Although HTML itself is not regular (probably not anything sane in the
>> latest incarnations), well formed tags should be as far as I know.
>> Here is a simple fix to the problem of giant embedded images in HTML:
>> drop all tags.  Unbalanced < > could force an HTML part not to be
>> indexed.
>
> What about attribute values?
>
>     <input value="a<b">
>
> Contrary to a lot of misinformation on the web, I'm pretty sure this is
> perfectly legal in HTML (not XML).
>
> Docs: https://www.w3.org/TR/html5/syntax.html#attributes-0
>
> In the JavaScript regex format, I believe the correct way to parse this is:
>
>     /<("[^"]*"|'[^']*'|[^"'>]*)*>/g
>
> Basically, while inside a tag, ignore everything between double and single quotes.
Thanks for the reality check. It should be possible to handle quotes. In
my limited understanding of that regex, we can do a bit better by
forcing pairs of quotes to match, since I <chaos attribute="'"> is
probably legal.

Cheers,

d

_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch
David Bremner-2 David Bremner-2
Reply | Threaded
Open this post in threaded view
|

Re: [David Bremner] Re: RFC: drop html tags

David Bremner <[hidden email]> writes:

> From: David Bremner <[hidden email]>
> Subject: Re: RFC: drop html tags
> To: Steven Allen <[hidden email]>
> Date: Tue, 21 Mar 2017 14:03:10 -0300
>
> Steven Allen <[hidden email]> writes:
>
>> In the JavaScript regex format, I believe the correct way to parse this is:
>>
>>     /<("[^"]*"|'[^']*'|[^"'>]*)*>/g
>>
>> Basically, while inside a tag, ignore everything between double and single quotes.
>
> Thanks for the reality check. It should be possible to handle quotes. In
> my limited understanding of that regex, we can do a bit better by
> forcing pairs of quotes to match, since I <chaos attribute="'"> is
> probably legal.

Actually, I'm wrong. My eyes just glaze over when faced with any
non-trivial regex, I guess.

d
_______________________________________________
notmuch mailing list
[hidden email]
https://notmuchmail.org/mailman/listinfo/notmuch