HTML Parsing - Haskell

Welcome to the Functional Programming Zulip Chat Archive. You can join the chat here.

jkeuhlen

Is tagsoup/scalpel the best option for HTML parsing right now? I'm running into performance issues with it and wondering if there is a better option already or if I should try to dig into the mess that is HTML parsing myself and try to make some improvements.

Torsten Schmits

I've only used html-conduit for some simple things, which uses blaze as a backend. It advertises itself as blazingly fast :wink:

James King

In general, HTML parsing is messy business. If you limit yourself to well-formed, strict HTML you can write things that are optimally performant.

James King

But if you need to be able to parse a broad range of HTML documents, parsers can use plenty of heuristics to clean up the data which can be slow.

jkeuhlen

Yeah I'm using sanitizeBalance to clean up some untrusted HTML that can definitely be malformed, but it uses tagsoup under the hood which has performance issues :frown:

James King

I don't know that there are any better algorithms that what they use for parsing malformed HTML. That's the trade off you need to deal with in that area. :shrug:

James King

But if, after parsing the malformed, and you need to revisit the data again you could at least serialize out the fixed HTML so that later batch jobs will be much faster.... :thinking:

jkeuhlen

Yeah I'm doing it all in real-time right now, but I'll probably do that in the future as an optimization.