Is tagsoup/scalpel the best option for HTML parsing right now? I'm running into performance issues with it and wondering if there is a better option already or if I should try to dig into the mess that is HTML parsing myself and try to make some improvements.
Yeah I'm using sanitizeBalance to clean up some untrusted HTML that can definitely be malformed, but it uses tagsoup under the hood which has performance issues :frown:
I don't know that there are any better algorithms that what they use for parsing malformed HTML. That's the trade off you need to deal with in that area. :shrug:
But if, after parsing the malformed, and you need to revisit the data again you could at least serialize out the fixed HTML so that later batch jobs will be much faster.... :thinking:
Is tagsoup/scalpel the best option for HTML parsing right now? I'm running into performance issues with it and wondering if there is a better option already or if I should try to dig into the mess that is HTML parsing myself and try to make some improvements.
I've only used html-conduit for some simple things, which uses blaze as a backend. It advertises itself as blazingly fast :wink:
In general, HTML parsing is messy business. If you limit yourself to well-formed, strict HTML you can write things that are optimally performant.
But if you need to be able to parse a broad range of HTML documents, parsers can use plenty of heuristics to clean up the data which can be slow.
It's trade offs
Yeah I'm using
sanitizeBalance
to clean up some untrusted HTML that can definitely be malformed, but it uses tagsoup under the hood which has performance issues :frown:I don't know that there are any better algorithms that what they use for parsing malformed HTML. That's the trade off you need to deal with in that area. :shrug:
But if, after parsing the malformed, and you need to revisit the data again you could at least serialize out the fixed HTML so that later batch jobs will be much faster.... :thinking:
Yeah I'm doing it all in real-time right now, but I'll probably do that in the future as an optimization.