Untangling Haskell's Strings

May 15

Haskell is not without its faults. One of the most universally acknowledged annoyances, even for pros, is keeping track of the different string types. There are, in total, five different types representing strings in Haskell. Remember Haskell is strongly typed. So if we want to represent strings in different ways, we have to have different types for them. This motivates the need for these five types, all with slightly different use cases. It's not so bad when you're using any one of them. But when you constantly have to convert back and forth between them, it can be a major hassle. In this article we'll go over these five different types. We'll examine their different use cases, and observe how to convert between them.

Strings

The String type is the most basic form of representing strings in Haskell. It is a simple type synonym for a list of unicode characters (the Char type). So whenever you see [Char] in your compile errors, know this refers to the basic String type. By default, when you enter in a string literal in your Haskell code, the compiler infers it as a String.

myFirstString :: String
myFirstString = Hello

The list representation of strings gives us some useful behavior. Since a string is actually a list, we can use all kinds of familiar functions from Data.List.

>> let a = "Hello"
>> map Data.Char.toUpper a
"HELLO"
>> 'x' : a
"xHello"
>> a ++ " Person!"
"Hello Person!"
>> Data.List.sort "dbca"
"abcd"

The main drawback of the vanilla string type is its inefficiency. This comes as a consequence of immutability. For instance suppose we have:

myFirstString :: String
myFirstString = "Hello"

myModifiedString :: String
myModifiedString = map toLower (sort (myFirstString ++ " Person!"))

This will allocate a total of 4 strings. The first is myFirstString. The second is the " Person!" literal. We can then append these without making another string. But then the sorted version will be a third allocation, and the fourth will be the lowercased version. This constant allocation can make our code non-performant and inappropriate for heavy duty operations.

Text

The Text family of string types solves this dilemma. There are two Text types: strict and lazy. Most often, you will use the strict form. However, the lazy form is useful in certain circumstances where you know you won't need the full string.

The main advantage Text has is that its functions are subject to "fusion". This means the compiler can actually prevent the issue of multiple allocations we saw in the last example. For instance, if we look at this:

import qualified Data.Char as C
import qualified Data.Text as T

optimizedTextVersion :: T.Text
optimizedTextVersion = T.cons 'c' (T.map C.toLower (T.append (T.Text "Hello ") (T.Text " Person!")))

This will only actually allocate a single Text object at runtime. This will make it it substantially more efficient than the String version. So for industrial use of heavy text processing, you are much better off using the Text type than the String type.

ByteString

The third family of types fall into the ByteString category. As with Text, there are strict and lazy variants of bytestrings. Lazy bytestrings are a bit more common than lazy text values though. Bytestrings are the lowest level representation of the characters. It is the closest you can get to the real machine level interpretation of them. At their core, bytestrings are a list of Word8 objects. A Word8 is simply an 8-bit number representing a unicode character.

Most networking libraries will use bytestrings, as they make the most sense for serialization. When you send information across platforms, you can't be sure about the encoding on the other end. If you store information in a database, you will often want to use bytestrings as well. Like Text types, they are generally far more efficient than strings.

Conversion

So with all these types floating around, the real problem is converting between them. It can be enormously frustrating when you want to write some basic code but you have a different string type. You'll have to look up the conversion if you don't remember, and this can be annoying. Our first example will be with String and Text. This is quite straightforward. The Data.Text package exports these two functions, which do exactly what you want:

pack :: String -> Text
unpack :: Text -> String

There are equivalents in Data.Text.Lazy. We'll find similar functions for going between ByteStrings and Strings. They exist in the Data.ByteString.Char8 package:

pack :: String -> ByteString
unpack :: ByteString -> String

Note these only work with strict ByteStrings. To convert between strict and lazy, you'll want functions from the .Lazy version of a text type. For instance, Data.Text.Lazy exports:

-- (Lazy to strict)
toStrict :: Data.Text.Lazy.Text -> Data.Text.Text

-- (Strict to lazy)
fromStrict :: Data.Text.Text -> Data.Text.Lazy.Text

There are equivalents in Data.ByteString.Lazy. The final conversion we'll go over is between Text and ByteString. You could use String as an intermediate type with the functions above. But this makes certain assumptions about the encoding and is subject to failure. Going from Text to ByteString is straightforward, assuming you know your data format. The following functions exist in Data.Text.Encoding:

encodeUtf8 :: Text -> ByteString

-- LE = Little Endian format, BE = Big Endian

encodeUtf16LE :: Text -> ByteString
encodeUtf16BE :: Text -> ByteString
encodeUtf32LE :: Text -> ByteString
encodeUtf32BE :: Text -> ByteString

In general, you'll use UTF8 encoded text and thus encodeUtf8. Decoding is a little more complicated. Simple functions exist in this same library:

decodeUtf8 :: ByteString -> Text
decodeUtf16LE :: ByteString -> Text
decodeUtf16BE :: ByteString -> Text
decodeUtf32LE :: ByteString -> Text
decodeUtf32BE :: ByteString -> Text

But these can throw errors if your bytestring does not match the format. Run-time exceptions are bad, so for UTF8, we have this function:

decodeUtf8' :: ByteString -> Either UnicodeException Text

Which let's us wrap this in an Either and handle possible errors. For the other formats, we have to rely on functions like:

decodeUtf16LEWith :: OnDecodeError -> ByteString -> Text

Where OnDecodeError is a specific type of handler. These functions can be particularly cumbersome and difficult to deal with. Luckily, you'll most often be using UTF8.

OverloadedStrings

So we haven't touched too much on language extensions yet in my articles. But here's our first real example of one. It's intended to show you language extensions aren't particularly scary! As we saw earlier, Haskell will in general interpret string literals in your code as the String type. This means you are unable to have the following code:

-- Fails
myText :: Text
myText = "Hello"

myBytestring :: ByteString
myBytestring = "Hello"

The compiler expects both of these values to be of String type, and not the types you gave. So this will normally throw compiler errors. However, with the OverloadedStrings extension, you can fix this! Extensions use tags like {-# LANGUAGE … #-}. They are generally added at the top of your source file.

{-# LANGUAGE OverloadedStrings #-}
-- This works!
myText :: Text
myText = "Hello"

myBytestring :: ByteString
myBytestring = "Hello"

In fact, for any type you make, you can create an instance of the IsString typeclass. This will allow you to use string literals to represent it.

{-# LANGUAGE OverloadedStrings #-}

import qualified Data.String (IsString(..))

data MyType = MyType String

instance IsString MyType where
  fromString s = MyType s

myTypeAsString :: MyType
myTypeAsString = "Hello"

You can also enable this extension within GHCI. You need to use the command :set -XOverloadedStrings.

Summary

Haskell uses 5 different types for representing strings. Two of these are lazy versions. The String type is a type synonym for a list of characters, and is generally inefficient. Text represents strings somewhat differently, and can fuse operations together for efficiency. ByteString is a low level representation most suited to serialization. There are a lot of ways to convert between types, and it's hard to keep them straight. Finally, the OverloadedStrings compiler can make your life easier. It allows you to use a string literal refer to any of your different string types.

If you haven't had a chance to get started with Haskell yet, you should get our Getting Started Checklist. It will guide you through installation and some basics!

Be sure to check back next week! Now that we understand strings, we'll divide into another potentially thorny system of different types. We'll investigate the different numeric types and the simple conversions we can run between them.

Beginners

James Bowen