The Structure of an MCAP File

Dec 1

Last week we started learning about ROS2 and the MCAP data format. Using Megaparsec, we wrote some parsers for a few primitive data types as they exist in MCAP. In this article we’re going to write a parser that will show us the basic structure of our MCAP file. Next time, we’ll use this structure to actually read information on the topics we wrote to our bag!

In our Solve.hs course, you’ll learn all kinds of techniques for parsing! You’ll learn some basic techniques, regular expressions, and you’ll learn all the ins and outs of Megaparsec! Check it out today!

MCap File Structure

To understand the information in an MCAP file, we need to understand the file’s structure. Here’s how it’s defined in the official specification.

<Magic><Header><Data section>[<Summary section>][<Summary Offset section>]<Footer><Magic>

We won’t deal with all the complexity here. The first part to understand is the Magic bytes that appear at the beginning and end of the file. This is literally just a hardcoded sequence that contains the letters MCAP and the major version number, which in our case is still 0 (ASCII code 48, hex code 0x30). So we can write a simple parser for this sequence:

import qualified Data.ByteString.Lazy as BS
import Text.Megaparsec

parseMagic :: Parser ()
parseMagic = void $ chunk $ BS.pack [137, 77, 67, 65, 80, 48, 13, 10]

There are several other sections. The header, the data section, the summary section, the summary offset section, and the footer. But all of them consist of Records. And all records have the same basic structure for us to parse. So for our next step, let’s understand the different types of records we can have in an MCAP file.

Record Types

There are 15 types of records, so let’s list them out:

Header
Footer
Schema
Channel
Message
Chunk
Message Index
Chunk Index
Attachment
Attachment Index
Statistics
Metadata
Metadata Index
Summary Offset
Data End

Each section of the file only permits certain kinds of records. For example, each file has only a single header and footer record, at the start and end of the file, respectively. Our topic data generally lives in Message records, which are often nested in Chunk records inside the “data” section of the file. The Index record types exist for helping read through the bag file quickly.

Again, we’ll skip dealing with the details of most of these record types, and we won’t do any validation on types of records being in the right section of the file. Our concrete goal for this article is to write a parser that can print out the type of each record in the bag so we can understand the structure better.

To facilitate this, let’s write a data type to capture these possibilities:

data RecordType =
  FileHeader |
  FileFooter |
  Schema |
  Channel |
  Message |
  Chunk |
  MessageIndex |
  ChunkIndex |
  Attachment |
  AttachmentIndex |
  Statistics |
  Metadata |
  MetadataIndex |
  SummaryOffset |
  DataEnd
  deriving (Show, Eq, Enum)

Parsing Records

As mentioned above, all records have the same structure to parse. From the spec, we see:

<record type><record content length><record>

The record type is a single byte. The number should correspond to the list we have above. The content length is a 64-bit integer. Then the “record content” will always have a number of bytes equal to the record content length.

Knowing this, we can start to write a general implementation for a record parser. Let’s start with parsing the record type:

parseRecordType :: Parser (Word64, RecordType)
parseRecordType = do
  v <- anySingle
  guard' ("Invalid record type: " <> show v) (v > 0 && v <= 15)
  let typ = toEnum (v - 1)
  return $ (1, typ)

We use guard’ from the previous article to ensure we parse any numbers that would give invalid enums. Now let’s define a Record type to hold this. We’ll expand this type as we go through this series:

data Record = Record RecordType
  deriving (Show, Eq)

Since the high level structure of a record is so simple, we can easily write a simple parser for all records, as long as we ignore the actual data part and just focus on the record types.

parseSingleRecord :: Parser (Word64, Record)
parseSingleRecord = do
  (typLen, typ) <- parseRecordType
  (rclLen, recordContentLength) <- parseUint64LE
  count (fromIntegral recordContentLength) anySingle
  return (typLen + rclLen + recordContentLength, Record typ)

We get the type, then the content length, and finally just count out the bytes based on the content length.

A Basic Executable

Even with the small amount of code we’ve written so far, we can already print out the basic structure of a file! Here’s all we need to pull this together. First, we write a parser for the whole file. It starts and end with our “magic” parser, and in between it just parses a bunch of records.

parseMcapFile :: Parser [(Word64, Record)]
parseMcapFile = do
  parseMagic
  recs <- many (try parseSingleRecord)
  parseMagic
  return recs

Now our executable program reads this file and runs our parser!

printRec :: Record -> IO ()
printRec (Record typ) = print typ

printRecordTypesFromFile :: FilePath -> IO ()
parseBareRecordsFromFile fp = do
  input <- BS.readFile fp
  result <- runParserT parseMcapFile' fp input
  case result of
    Left e -> print e
    Right recs -> forM_ recs $ \(_, rec) -> printRec rec

On a very simple file I created, with two instances of the Simple message from last week, the record type structure looks like this:

FileHeader
Metadata
Chunk
MessageIndex
Metadata
DataEnd
Schema
Channel
Statistics
ChunkIndex
MetadataIndex
MetadataIndex
SummaryOffset
SummaryOffset
SummaryOffset
SummaryOffset
SummaryOffset
FileFooter

However, this still doesn’t tell quite the whole story. There are no Message records in this list, because they’re actually nested inside the single Chunk record. So before we wrap up this article, let’s understand a little bit about parsing record data. This will help us fill in the remaining structure.

Parsing Record Data

Each record has different rules dictating what kind of data it has. For example, the Footer record is defined like this:

bytes | name | type
8 | summary_start | uint64
8 | summary_offset_start | uint64
4 | summary_crc | uint32

It’s just 3 primitive fields, so it’s easy for us to write a parser for it:

parseFooterData :: Parser (Word64, (Word64, Word64, Word32))
parseFooterData = do
  -- Assume we’ve already parsed record type and content length
  (n1, summaryStart) <- parseUint64LE
  (n2, summaryOffsetStart) <- parseUint64LE
  (n3, summaryCrc) <- parseUint32LE
  return (n1 + n2 + n3, (summaryStart, summaryOffsetStart, summaryCrc))

But we’re mainly interested in the Chunk record type. It has a someone more complicated structure:

Bytes | Name | Type
8 | message_start_time | Timestamp
8 | message_end_time | Timestamp
8 | uncompressed_size | uint64
4 | uncompressed_crc | uint32
4 + N | compression | String
8 + N | records | uint64 length-prefixed Bytes

So we have to parse several fields, and then we have what is effectively an array of records. Since we’re parsing sub-records, this is recursive in theory. In practice though, a Chunk can never contain another chunk, so there’s a limit.

Several of these fields are concerned with compression, because individual chunks can be compressed, rather than the whole file. But for our experiment, we will assume no compression.

Writing a Chunk Parser

Before we get started with the parsing code, let’s expand our Record type to give chunks a special case. This isn’t the best, most general way to write this type, but it’s sufficient for our purposes right now.

data Record =
  Record RecordType |
  ChunkRecord [Record]
  deriving (Show, Eq)

We’ll also extend our printRec function so that, for a chunk record, it will print out the sub-record types of the chunk in an indented fashion.

printRec :: Record -> IO ()
printRec (ChunkRecord recs) = do
  putStrLn "Chunk"
  forM_ recs $ \rec -> do
    putStr "  "
    printRec rec
printRec (Record typ _) = print typ

Now let’s write the base of our chunk parsing function. We’ll parse the primitive fields, add up the “lengths” to get our total data size, and put some validation on the lack of compression. We want the “compression algorithm” string to be empty, and for the “compressed” and “uncompressed” lengths to match:

parseChunk :: Parser (Word64, Record)
parseChunk = do
  (n1, startTime) <- parseTimestamp
  (n2, endTime) <- parseTimestamp
  (n3, uncompressedLen) <- parseUint64LE
  (n4, uncompressedCrc) <- parseUint32LE
  (n5, compressionAlg) <- parseString
  (n6, compressedLen) <- parseUint64LE
  let totalLen = n1 + n2 + n3 + n4 + n5 + n6 + compressedLen
  guard' ("Compression not supported yet!") (BS.null compressionAlg && compressedLen == uncompressedLen)
  ...

What follows is actually very similar to array parsing from last time. The only time we can’t call parseArray using parseSingleRecord as our follow-up is that arrays have a 32-bit integer for the number of bytes, while a chunk record has a 64-bit integer. This could be fixed with better polymorphism:

parseChunk :: Parser (Word64, Record)
parseChunk = do
  (n1, startTime) <- parseTimestamp
  (n2, endTime) <- parseTimestamp
  (n3, uncompressedLen) <- parseUint64LE
  (n4, uncompressedCrc) <- parseUint32LE
  (n5, compressionAlg) <- parseString
  (n6, compressedLen) <- parseUint64LE
  let totalLen = n1 + n2 + n3 + n4 + n5 + n6 + compressedLen
  guard' ("Compression not supported yet!") (BS.null compressionAlg && compressedLen == uncompressedLen)
  recs <- f uncompressedLen []
  return (totalLen, ChunkRecord recs)
  where
    f 0 prev = return $ reverse prev
    f rem prev = do
      (recLen, rec) <- parseSingleRecord
      guard' ("Chunk record is too long: " <> show recLen <> " " <> show rem) (recLen <= rem)
      f (rem - recLen) (rec : prev)

We also have to change parseSingleRecord though. We want a special case when the record type is Chunk. In this case we’ll call the function we just wrote. Otherwise we stick with what we had before:

parseSingleRecord :: Parser (Word64, Record)
parseSingleRecord = do
  (typLen, typ) <- parseRecordType
  (rclLen, recordContentLength) <- parseUint64LE
  let totalLen = typLen + rclLen + recordContentLength
  record <- if typ == Chunk
    then do
      (parsedChunkLength, rec) <- parseChunk
      guard' ("Parsed chunk length does not match: " <> show parsedChunkLength <> " " <> show recordContentLength) (parsedChunkLength == recordContentLength)
      return rec
    else do
      count (fromIntegral recordContentLength) anySingle
      return $ Record typ []
  return (totalLen, record)

And now we can see the sub-chunk record types in our print-out!

FileHeader
Metadata
Chunk
  Schema
  Channel
  Message
  Message
MessageIndex
Metadata
DataEnd
Schema
Channel
Statistics
ChunkIndex
MetadataIndex
MetadataIndex
SummaryOffset
SummaryOffset
SummaryOffset
SummaryOffset
SummaryOffset
FileFooter

Conclusion

We can see in the readout above that our chunk consists of a Schema record, a Channel record, and 2 Message records. In the next article, we’ll learn how to parse those so that we can finally extract our messages!

In the meantime, if you want more details on Megaparsec you can take a look at our course Solve.hs! In Module 4 you’ll learn all kinds of tricks for parsing in Haskell!

James Bowen