Untangling Haskell's Strings
Haskell is not without its faults. One of the most universally acknowledged annoyances, even for pros, is keeping track of the different string types. There are, in total, five different types representing strings in Haskell. Remember Haskell is strongly typed. So if we want to represent strings in different ways, we have to have different types for them. This motivates the need for these five types, all with slightly different use cases. It's not so bad when you're using any one of them. But when you constantly have to convert back and forth between them, it can be a major hassle. In this article we'll go over these five different types. We'll examine their different use cases, and observe how to convert between them.
Strings
The String
type is the most basic form of representing strings in Haskell. It is a simple type synonym for a list of unicode characters (the Char
type). So whenever you see [Char]
in your compile errors, know this refers to the basic String
type. By default, when you enter in a string literal in your Haskell code, the compiler infers it as a String
.
myFirstString :: String
myFirstString = Hello
The list representation of strings gives us some useful behavior. Since a string is actually a list, we can use all kinds of familiar functions from Data.List.
>> let a = "Hello"
>> map Data.Char.toUpper a
"HELLO"
>> 'x' : a
"xHello"
>> a ++ " Person!"
"Hello Person!"
>> Data.List.sort "dbca"
"abcd"
The main drawback of the vanilla string type is its inefficiency. This comes as a consequence of immutability. For instance suppose we have:
myFirstString :: String
myFirstString = "Hello"
myModifiedString :: String
myModifiedString = map toLower (sort (myFirstString ++ " Person!"))
This will allocate a total of 4 strings. The first is myFirstString
. The second is the " Person!" literal. We can then append these without making another string. But then the sorted version will be a third allocation, and the fourth will be the lowercased version. This constant allocation can make our code non-performant and inappropriate for heavy duty operations.
Text
The Text
family of string types solves this dilemma. There are two Text
types: strict and lazy. Most often, you will use the strict form. However, the lazy form is useful in certain circumstances where you know you won't need the full string.
The main advantage Text
has is that its functions are subject to "fusion". This means the compiler can actually prevent the issue of multiple allocations we saw in the last example. For instance, if we look at this:
import qualified Data.Char as C
import qualified Data.Text as T
optimizedTextVersion :: T.Text
optimizedTextVersion = T.cons 'c' (T.map C.toLower (T.append (T.Text "Hello ") (T.Text " Person!")))
This will only actually allocate a single Text
object at runtime. This will make it it substantially more efficient than the String
version. So for industrial use of heavy text processing, you are much better off using the Text
type than the String
type.
ByteString
The third family of types fall into the ByteString
category. As with Text
, there are strict and lazy variants of bytestrings. Lazy bytestrings are a bit more common than lazy text values though. Bytestrings are the lowest level representation of the characters. It is the closest you can get to the real machine level interpretation of them. At their core, bytestrings are a list of Word8
objects. A Word8
is simply an 8-bit number representing a unicode character.
Most networking libraries will use bytestrings, as they make the most sense for serialization. When you send information across platforms, you can't be sure about the encoding on the other end. If you store information in a database, you will often want to use bytestrings as well. Like Text
types, they are generally far more efficient than strings.
Conversion
So with all these types floating around, the real problem is converting between them. It can be enormously frustrating when you want to write some basic code but you have a different string type. You'll have to look up the conversion if you don't remember, and this can be annoying. Our first example will be with String
and Text
. This is quite straightforward. The Data.Text
package exports these two functions, which do exactly what you want:
pack :: String -> Text
unpack :: Text -> String
There are equivalents in Data.Text.Lazy
. We'll find similar functions for going between ByteStrings
and Strings
. They exist in the Data.ByteString.Char8
package:
pack :: String -> ByteString
unpack :: ByteString -> String
Note these only work with strict ByteStrings
. To convert between strict and lazy, you'll want functions from the .Lazy
version of a text type. For instance, Data.Text.Lazy
exports:
-- (Lazy to strict)
toStrict :: Data.Text.Lazy.Text -> Data.Text.Text
-- (Strict to lazy)
fromStrict :: Data.Text.Text -> Data.Text.Lazy.Text
There are equivalents in Data.ByteString.Lazy
. The final conversion we'll go over is between Text
and ByteString
. You could use String
as an intermediate type with the functions above. But this makes certain assumptions about the encoding and is subject to failure. Going from Text
to ByteString
is straightforward, assuming you know your data format. The following functions exist in Data.Text.Encoding
:
encodeUtf8 :: Text -> ByteString
-- LE = Little Endian format, BE = Big Endian
encodeUtf16LE :: Text -> ByteString
encodeUtf16BE :: Text -> ByteString
encodeUtf32LE :: Text -> ByteString
encodeUtf32BE :: Text -> ByteString
In general, you'll use UTF8 encoded text and thus encodeUtf8
. Decoding is a little more complicated. Simple functions exist in this same library:
decodeUtf8 :: ByteString -> Text
decodeUtf16LE :: ByteString -> Text
decodeUtf16BE :: ByteString -> Text
decodeUtf32LE :: ByteString -> Text
decodeUtf32BE :: ByteString -> Text
But these can throw errors if your bytestring does not match the format. Run-time exceptions are bad, so for UTF8, we have this function:
decodeUtf8' :: ByteString -> Either UnicodeException Text
Which let's us wrap this in an Either
and handle possible errors. For the other formats, we have to rely on functions like:
decodeUtf16LEWith :: OnDecodeError -> ByteString -> Text
Where OnDecodeError
is a specific type of handler. These functions can be particularly cumbersome and difficult to deal with. Luckily, you'll most often be using UTF8.
OverloadedStrings
So we haven't touched too much on language extensions yet in my articles. But here's our first real example of one. It's intended to show you language extensions aren't particularly scary! As we saw earlier, Haskell will in general interpret string literals in your code as the String
type. This means you are unable to have the following code:
-- Fails
myText :: Text
myText = "Hello"
myBytestring :: ByteString
myBytestring = "Hello"
The compiler expects both of these values to be of String
type, and not the types you gave. So this will normally throw compiler errors. However, with the OverloadedStrings
extension, you can fix this! Extensions use tags like {-# LANGUAGE … #-}
. They are generally added at the top of your source file.
{-# LANGUAGE OverloadedStrings #-}
-- This works!
myText :: Text
myText = "Hello"
myBytestring :: ByteString
myBytestring = "Hello"
In fact, for any type you make, you can create an instance of the IsString
typeclass. This will allow you to use string literals to represent it.
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.String (IsString(..))
data MyType = MyType String
instance IsString MyType where
fromString s = MyType s
myTypeAsString :: MyType
myTypeAsString = "Hello"
You can also enable this extension within GHCI. You need to use the command :set -XOverloadedStrings
.
Summary
Haskell uses 5 different types for representing strings. Two of these are lazy versions. The String
type is a type synonym for a list of characters, and is generally inefficient. Text
represents strings somewhat differently, and can fuse operations together for efficiency. ByteString
is a low level representation most suited to serialization. There are a lot of ways to convert between types, and it's hard to keep them straight. Finally, the OverloadedStrings
compiler can make your life easier. It allows you to use a string literal refer to any of your different string types.
If you haven't had a chance to get started with Haskell yet, you should get our Getting Started Checklist. It will guide you through installation and some basics!
Be sure to check back next week! Now that we understand strings, we'll divide into another potentially thorny system of different types. We'll investigate the different numeric types and the simple conversions we can run between them.