Parsing pandoc documents - Haskell

Welcome to the Functional Programming Zulip Chat Archive. You can join the chat here.

Jan van Brügge

2020-03-06 10:55:27

Hi, I want to extract information out of a document parsed by pandoc and I am questioning my methods. At the moment I am using partial pattern matched in the Maybe monad to extract info, but this is quite tedious. For example I know the first thing will be a headline 1 with the content of My last n months and I want to extract the n as Int.

At the moment, I am doing:

parseDoc :: [Block] -> Maybe MyData
parseDoc doc = do
  (Header 1 x1 : xs1) <- pure doc
  n <- readMaybe . takeWhile isDigit . unpack =<< stripPrefix "My last " $ stringify x1

Is there a better way?

2020-03-06 10:57:28

SYB

2020-03-06 10:58:21

you can use everywhere (++) $ mkQ [] $ \case Str something -> try to extract your stuff here

2020-03-06 10:59:51

or you know, whatever pattern you want to find

Jan van Brügge

2020-03-06 11:02:57

thanks, I will try that

Jan van Brügge

2020-03-06 11:16:28

image-9206c6a0-f708-4365-8841-dd3e198a6a52.jpg

2020-03-06 11:16:50

oops sorry

2020-03-06 11:16:52

you want everything

2020-03-06 11:16:55

not everywhere

2020-03-06 11:17:03

everywhere is a cata

Jan van Brügge

2020-03-06 11:17:21

Is there a good tutorial about this somwhere? The site that is linked everywhere in the docs is down apparently

2020-03-06 11:18:32

ths is the paper: https://www.microsoft.com/en-us/research/wp-content/uploads/2003/01/hmap.pdf

2020-03-06 11:18:35

it's been a while since i read it

2020-03-06 11:19:28

at a high level; a Data a constraint says that you have some sort of run-time representation of a

2020-03-06 11:19:38

run-time generic representation*

2020-03-06 11:19:48

(as opposed to GHC.Generics which is at compile time)

2020-03-06 11:20:13

everything lets you run a monoidal query over some Data a => a

2020-03-06 11:20:44

you do that with mkQ which takes a default value for if it doesn't match anything, and then give it a lambda specialized at whatever type you want

2020-03-06 11:21:03

everything will search through the type for the type your lambda takes, and then run your function, and accumulates the results

2020-03-06 11:21:42

for example, i wrote this just today, which finds all of the UnboundVar nodes inside of an Exp:

unboundVars :: Exp -> [Name]
unboundVars = everything (++) $
  mkQ [] $ \case
    UnboundVarE n -> [n]
    _ -> []

2020-03-06 11:22:39

you can also use extQ to glue multiple mkQs together, eg if you want to simultaneously target different types in your structure

2020-03-06 11:23:34

maybe that helps?

Jan van Brügge

2020-03-06 11:24:32

I will try, thanks Sandy!

Jan van Brügge

2020-03-06 17:34:31

SYB was not really a solution to my problem, because the data is not nested very deep. I finally found something I am reasonable happy with, this disgusting beauty:

newtype ParseM s a = MkParseM (State s (Maybe a))

instance Functor (ParseM s) where
  fmap f (MkParseM x) = MkParseM (fmap (fmap f) x)

instance Applicative (ParseM s) where
  pure x = MkParseM $ pure $ Just x
  (MkParseM x) <*> (MkParseM y) = MkParseM $ x >>= \f -> y >>= \v -> pure $ f <*> v

instance Monad (ParseM s) where
  (MkParseM x) >>= f = MkParseM $
    x >>= \case
      Just y -> case (f y) of MkParseM r -> r
      Nothing -> pure Nothing

instance MonadFail (ParseM s) where
  fail _ = MkParseM (pure Nothing)

embed :: Maybe a -> ParseM s a
embed = MkParseM . pure

head' :: ParseM [Block] Block
head' = MkParseM $ do
  doc <- get
  case doc of
    x : xs -> put xs *> pure (Just x)
    _ -> pure Nothing

runParser :: ParseM s a -> s -> Maybe a
runParser (MkParseM x) = evalState x

parseRetro :: [Block] -> Maybe Int
parseRetro = runParser $ do
  Header 1 _ (stringify -> h1) <- head'
  n <- embed $ readMaybe . takeWhile isDigit . unpack =<< stripPrefix "My past " h1
  let past = MkPastMonths n
  pure $ n

2020-03-06 17:56:06

@Jan van Brügge

parseRetro :: [Block] -> Maybe Int
parseRetro (Header 1 _ h1' : _)
  | (stringify -> stripPrefix "My past " -> Just h1) <- h1'
  = fmap MkPastMonths $ readMaybe $ takeWhile isDigit $ unpack h1
parseRetro _ = Nothing

Jan van Brügge

2020-03-06 18:05:09

@TheMatten this is just the first thing I need to parse, and threading through the remainder of the list would get old very quickly

Jan van Brügge

2020-03-06 18:05:58

with this State wrapper, I can just continue with partial pattern matches until I've exhausted the list

Jan van Brügge

2020-03-06 19:13:42

For example I think this code reads rather nice:

parseRetro :: [Block] -> Maybe [Project]
parseRetro = runParser $ do
  Header 1 _ (stringify -> h1) <- head'
  (n :: Int) <- embed $ readMaybe . takeWhile isDigit . unpack =<< stripPrefix "My past " h1
  projects <-
    takeWhileM
      ( \case
          Header 2 _ (stringify -> ("Project" `isPrefixOf`) -> True) -> Just parseProject
          _ -> Nothing
      )
  -- let past = MkPastMonths n projects
  pure projects

Jan van Brügge

2020-03-06 19:14:52

With takeWhileM being:

takeWhileM :: (a -> Maybe (Parser [a] b)) -> Parser [a] [b]
takeWhileM f = get' >>= \case
  [] -> pure []
  (x : xs) -> case f x of
    Nothing -> pure []
    Just p -> put' xs *> p >>= \b -> (:) b <$> takeWhileM f

Sridhar Ratnakumar

2020-03-06 19:15:35

I only skimmed this topic, but cannot this be solved using Text.Pandoc.Walk (see query and walkM)?

example: https://github.com/srid/rib/blob/e41eae3/src/Rib/Parser/Pandoc.hs#L102-L109

Haskell static site generator based on Shake. Contribute to srid/rib development by creating an account on GitHub.

Jan van Brügge

2020-03-06 19:16:56

The problem is not finding the stuff, but rather pulling the information out of it. AFAICT query can't help me there

Sridhar Ratnakumar

2020-03-06 19:17:41

That example actually pulls the image URL using query

Asad Saeeduddin

2020-03-06 22:56:51

@Jan van Brügge I can't exactly reconcile your snippet:

parseDoc :: [Block] -> Maybe MyData
parseDoc doc = do
  (Header 1 x1 : xs1) <- pure doc
  n <- readMaybe . takeWhile isDigit . unpack =<< stripPrefix "My last " $ stringify x1

with the partial maybe matching you're referring to

Asad Saeeduddin

2020-03-06 23:43:03

@Jan van Brügge is something like this what you're looking for:

  let
    focus :: Prism' (Maybe (a + String)) String
    focus = _Just . right' . predicate ((< 7) . length)

  print $ re focus `whittle`
    [ Just $ Right $ "Hello!"
    , Nothing
    , Just $ Left $ -1
    , Just $ Right $ "Whoops!"
    ]
  -- > ["Hello!"]

Asad Saeeduddin

2020-03-06 23:44:46

whittle :: Filterable f => Coprism s t a b -> f b -> f t
whittle p b = runJoker $ p $ Joker $ b

There's a trivial Filterable instance for any Alternative + Monad (although whether it's lawful depends on the specific monad and alternative instances)