Megaparsec basics

2020-09-26 11:15:05

I'm messing with megaparsec, trying to create a simple language. Right now I'm trying to lex

x = 1
y = 2

into

[ Name "x", Equals, Lit 1, NewLine, Name "y", Equals, Lit 2, NewLine]

for later parsing.

The problem that I'm running into, is that I'm confused about how to use L.space. I'll need to give L.space a parser which does not skip newlines, so that I can emit NewLine tokens. The docs recommend a setup where you define a lexeme that automatically skips trailing whitespace. I currently have

type Lexer = Parsec Void Text

-- SNIP

sc :: Lexer ()
sc = L.space space1
                   (L.skipLineComment "--")
                   (L.skipBlockComment "{-" "-}")

lexeme :: Lexer a -> Lexer a
lexeme = L.lexeme sc

symbol :: Text -> Lexer Text
symbol = L.symbol sc

So, I think I'll need to modify sc to not eat newlines by passing something other than space1 as the first argument to L.space.

The thing I can't figure out is how to deal with the fact that often, you will expect to find a newline immediately after a preceding lexeme, with no intervening whitespace. Since the argument to L.space can't accept empty input, sc can't accept empty input. How do I deal with this? Do I have to throw out the whole model of always consuming whitespace after a lexeme? Is making newline a lexeme the wrong approach? If so how do I handle parsing the separate bindings later without newline information?

bradrn

2020-09-26 12:38:39

As you have noted, your problem here is that space1 (and hence sc) also consumes newlines, as they are considered just another form of whitespace. Luckily, that’s easy enough to fix: you just need to change space1 to hspace1 (another predefined parser which doesn’t accept newlines). To parse a newline, you can just add Newline <$ eol at the appropriate place in your lexer.

(And also, by the way, the usual approach with megaparsec is to combine lexing and parsing into one step. megaparsec still has support for two stages of lexing+parsing, but it’s usually easier to use one step instead.)

James Sully

2020-09-26 12:56:31

That's actually kind of what I had. Couple of things:

I was trying to use hspace1, but I'm getting variable not found, even though I'm on megaparsec 8.0.0, and importing Text.Megaparsec.Char unqualified:

> stack ls dependencies|grep mega                                                                                                                1s 511ms
megaparsec 8.0.0

Not sure what's going on there...?

As a workaround, I tried defining

isHSpace :: Char -> Bool
isHSpace x = isSpace x && x /= '\n' && x /= '\r'

hspace1 = void $ some (satisfy isHSpace)

But I was getting "unexpected newline".

Let me change it back, and replicate the error...

James Sully

2020-09-26 13:00:15

Oh wait, it works this time. Not sure what I did differently. Thanks!

James Sully

2020-09-26 13:00:49

Still very confused that I can't access the predefined hspace1 from megaparsec

Georgi Lyubenov // googleson78

2020-09-26 13:02:33

James Sully said:

Still very confused that I can't access the predefined hspace1 from megaparsec

hspace1 is defined in 9.0.0, I can't find it in 8.0.0 in hackage - https://hackage.haskell.org/package/megaparsec-8.0.0

James Sully

2020-09-26 13:04:38

Ah, I didn't realize I was looking at the 9.0.0 docs. I think there's a typo, the 9.0.0 docs say it existed since 8.0.0:

hspace1 :: (MonadParsec e s m, Token s ~ Char) => m ()Source#

Like space1, but does not accept newlines and carriage returns.

Since: 8.0.0

Thanks!

@bradrn thanks for the advice re parsing and lexing simultaneously, I'll play around with that

bradrn

2020-09-26 13:20:14

You’re welcome for the advice! (And sorry for recommending a function from 9.0.0; I hadn’t realised hspace1 was new).

For parsing and lexing simultaneously, the basic idea is to define your parsers directly in terms of your lexers, rather than indirectly in terms of your lexeme data type. For instance, instead of defining parseEquality = Equality <$> satisfy isName <* single Equals <*> satisfy isLiteral, say, you might instead define parseEquality = Equality <$> parseName <* symbol "=" <*> parseLiteral, with parseName :: Parser String rather than parseName :: Parser Lexeme. (Admittedly I’m not completely sure how that would work with significant newlines — I’d say that manually adding (eol <|> eof) parsers at the appropriate places might work, so something like parseEquality = Equality <$> satisfy isName <* single Equals <*> satisfy isLiteral <* (eol <|> eof), but maybe just try it and see what works.

James Sully

2020-09-26 14:04:07

Here's what I have at the moment, seems to be working ok:

bradrn

2020-09-26 14:04:43

I can’t see the code…

James Sully

2020-09-26 14:05:02

module Lib where

import qualified Data.Text as T
import           Data.Text (Text)
import qualified Data.Map as M
import           Data.Map (Map)
import           Text.Megaparsec
import           Text.Megaparsec.Char
import qualified Text.Megaparsec.Char.Lexer as L
import Data.Void
import           Data.Char (isSpace)
import           Control.Monad (void)


type Parser = Parsec Void Text

type Args = [Text]

type Ctx = Map Text Expr

data Expr = Lit Int
          -- | Lam Args Expr
          | Var Text
          | Plus Expr Expr
  deriving (Eq, Show)

data Binding = Binding Text Expr
  deriving (Eq, Show)

data RuntimeErr = NameNotDefined Text
                | MainNotFound
  deriving (Eq, Show)

eval :: Ctx -> Expr -> Either RuntimeErr Int
eval ctx = \case
  Lit i  -> Right i
  Var n -> case M.lookup n ctx of
    Just expr -> eval' expr
    Nothing   -> Left $ NameNotDefined n
  Plus e1 e2  -> (+) <$> eval' e1 <*> eval' e2
  where
    eval' = eval ctx

executeProg :: [Binding] -> Either RuntimeErr Int
executeProg bs = case M.lookup "main" ctx of
    Just mainExpr -> eval ctx mainExpr
    Nothing       -> Left MainNotFound
  where ctx = M.fromList $ map (\(Binding name expr) -> (name, expr)) bs

expr = do
  choice
    [ Lit <$> integer
    , Var <$> name
    , Plus <$> expr <*> expr
    ]

data LangToken = Name Text
               | TokLit Int
               | Equals
               | RightArrow
               | NewLine

               deriving (Show, Eq)

isHSpace :: Char -> Bool
isHSpace x = isSpace x && x /= '\n' && x /= '\r'

hspace1 = void $ some (satisfy isHSpace)


-- space consumer
sc :: Parser ()
sc = L.space hspace1
             empty
             empty


-- vertical space consumer
vsc :: Parser ()
vsc = L.space space1
              (L.skipLineComment "--")
              (L.skipBlockComment "{-" "-}")


lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc

symbol :: Text -> Parser Text
symbol = L.symbol sc

rightArrow :: Parser LangToken
rightArrow = RightArrow <$ symbol "->"

integer :: Parser Int
integer = lexeme $ L.decimal

equals :: Parser LangToken
equals = Equals <$ symbol "="

name :: Parser Text
name = T.pack <$> lexeme
  ((:) <$> letterChar <*> many alphaNumChar <?> "name")

binding :: Parser Binding
binding = do
  n <- name
  equals
  i <- integer --todo
  eol
  sc
  pure $ Binding n (Lit i)

nl :: Parser LangToken
nl = NewLine <$ eol

-- langToken :: Parser LangToken
-- langToken = choice
--   [ name
--   , integer
--   , equals
--   , rightArrow
--   , nl
--   ]

program :: Parser [Binding]
program = do
  bindings <- many $ vsc >> binding
  eof
  pure bindings