Parsing pandoc documents - Haskell

Welcome to the Functional Programming Zulip Chat Archive. You can join the chat here.

Jan van Brügge

Hi, I want to extract information out of a document parsed by pandoc and I am questioning my methods. At the moment I am using partial pattern matched in the Maybe monad to extract info, but this is quite tedious. For example I know the first thing will be a headline 1 with the content of My last n months and I want to extract the n as Int.

At the moment, I am doing:

parseDoc :: [Block] -> Maybe MyData
parseDoc doc = do
  (Header 1 x1 : xs1) <- pure doc
  n <- readMaybe . takeWhile isDigit . unpack =<< stripPrefix "My last " $ stringify x1

Is there a better way?

Sandy Maguire

you can use everywhere (++) $ mkQ [] $ \case Str something -> try to extract your stuff here

Sandy Maguire

or you know, whatever pattern you want to find

Jan van Brügge

thanks, I will try that

Jan van Brügge

Is there a good tutorial about this somwhere? The site that is linked everywhere in the docs is down apparently

Sandy Maguire

it's been a while since i read it

Sandy Maguire

at a high level; a Data a constraint says that you have some sort of run-time representation of a

Sandy Maguire

run-time generic representation*

Sandy Maguire

(as opposed to GHC.Generics which is at compile time)

Sandy Maguire

everything lets you run a monoidal query over some Data a => a

Sandy Maguire

you do that with mkQ which takes a default value for if it doesn't match anything, and then give it a lambda specialized at whatever type you want

Sandy Maguire

everything will search through the type for the type your lambda takes, and then run your function, and accumulates the results

Sandy Maguire

for example, i wrote this just today, which finds all of the UnboundVar nodes inside of an Exp:

unboundVars :: Exp -> [Name]
unboundVars = everything (++) $
  mkQ [] $ \case
    UnboundVarE n -> [n]
    _ -> []
Sandy Maguire

you can also use extQ to glue multiple mkQs together, eg if you want to simultaneously target different types in your structure

Jan van Brügge

I will try, thanks Sandy!

Jan van Brügge

SYB was not really a solution to my problem, because the data is not nested very deep. I finally found something I am reasonable happy with, this disgusting beauty:

newtype ParseM s a = MkParseM (State s (Maybe a))

instance Functor (ParseM s) where
  fmap f (MkParseM x) = MkParseM (fmap (fmap f) x)

instance Applicative (ParseM s) where
  pure x = MkParseM $ pure $ Just x
  (MkParseM x) <*> (MkParseM y) = MkParseM $ x >>= \f -> y >>= \v -> pure $ f <*> v

instance Monad (ParseM s) where
  (MkParseM x) >>= f = MkParseM $
    x >>= \case
      Just y -> case (f y) of MkParseM r -> r
      Nothing -> pure Nothing

instance MonadFail (ParseM s) where
  fail _ = MkParseM (pure Nothing)

embed :: Maybe a -> ParseM s a
embed = MkParseM . pure

head' :: ParseM [Block] Block
head' = MkParseM $ do
  doc <- get
  case doc of
    x : xs -> put xs *> pure (Just x)
    _ -> pure Nothing

runParser :: ParseM s a -> s -> Maybe a
runParser (MkParseM x) = evalState x

parseRetro :: [Block] -> Maybe Int
parseRetro = runParser $ do
  Header 1 _ (stringify -> h1) <- head'
  n <- embed $ readMaybe . takeWhile isDigit . unpack =<< stripPrefix "My past " h1
  let past = MkPastMonths n
  pure $ n
TheMatten

@Jan van Brügge

parseRetro :: [Block] -> Maybe Int
parseRetro (Header 1 _ h1' : _)
  | (stringify -> stripPrefix "My past " -> Just h1) <- h1'
  = fmap MkPastMonths $ readMaybe $ takeWhile isDigit $ unpack h1
parseRetro _ = Nothing
Jan van Brügge

@TheMatten this is just the first thing I need to parse, and threading through the remainder of the list would get old very quickly

Jan van Brügge

with this State wrapper, I can just continue with partial pattern matches until I've exhausted the list

Jan van Brügge

For example I think this code reads rather nice:

parseRetro :: [Block] -> Maybe [Project]
parseRetro = runParser $ do
  Header 1 _ (stringify -> h1) <- head'
  (n :: Int) <- embed $ readMaybe . takeWhile isDigit . unpack =<< stripPrefix "My past " h1
  projects <-
    takeWhileM
      ( \case
          Header 2 _ (stringify -> ("Project" `isPrefixOf`) -> True) -> Just parseProject
          _ -> Nothing
      )
  -- let past = MkPastMonths n projects
  pure projects
Jan van Brügge

With takeWhileM being:

takeWhileM :: (a -> Maybe (Parser [a] b)) -> Parser [a] [b]
takeWhileM f = get' >>= \case
  [] -> pure []
  (x : xs) -> case f x of
    Nothing -> pure []
    Just p -> put' xs *> p >>= \b -> (:) b <$> takeWhileM f
Sridhar Ratnakumar

I only skimmed this topic, but cannot this be solved using Text.Pandoc.Walk (see query and walkM)?

example: https://github.com/srid/rib/blob/e41eae3/src/Rib/Parser/Pandoc.hs#L102-L109

Haskell static site generator based on Shake. Contribute to srid/rib development by creating an account on GitHub.
Jan van Brügge

The problem is not finding the stuff, but rather pulling the information out of it. AFAICT query can't help me there

Sridhar Ratnakumar

That example actually pulls the image URL using query

Asad Saeeduddin

@Jan van Brügge I can't exactly reconcile your snippet:

parseDoc :: [Block] -> Maybe MyData
parseDoc doc = do
  (Header 1 x1 : xs1) <- pure doc
  n <- readMaybe . takeWhile isDigit . unpack =<< stripPrefix "My last " $ stringify x1

with the partial maybe matching you're referring to

Asad Saeeduddin

@Jan van Brügge is something like this what you're looking for:

  let
    focus :: Prism' (Maybe (a + String)) String
    focus = _Just . right' . predicate ((< 7) . length)

  print $ re focus `whittle`
    [ Just $ Right $ "Hello!"
    , Nothing
    , Just $ Left $ -1
    , Just $ Right $ "Whoops!"
    ]
  -- > ["Hello!"]
Asad Saeeduddin
whittle :: Filterable f => Coprism s t a b -> f b -> f t
whittle p b = runJoker $ p $ Joker $ b

There's a trivial Filterable instance for any Alternative + Monad (although whether it's lawful depends on the specific monad and alternative instances)