Sunday, December 27, 2009

Generating Text in Factor

Some months back, I came across a few blog posts about generating text using algorithms. A simple algorithm was implemented in Clojure and Haskell.

Basically, the idea is to:

  1. Read in a text document.
  2. Count the frequency of word pairs in the document.
  3. Pick a starting word.
  4. Generate random text, using the word pair probabilities.

I wanted to see what a simple Factor implementation would look like, and thought I would share one below.

First, create a list of words from some lines of text. In Factor, it is easy to write a processing word that can then be used from standard input, files, or other input streams.

: split-words ( -- sequence )
    V{ } clone [
        "\r\n\t.,\" " split [ over push ] each
    ] each-line harvest ;

Next, create a sequence of all word pairs (including the pair linking the tail to the head of the list) from the input sequence.

: word-pairs ( sequence -- sequence' )
    dup 1 head-slice append
    dup 1 tail-slice zip ;

Next, we need a map of word pairs, for random sampling by frequency of occurrence. I have chosen to use words in the text as keys, and have the values be a sequence of all words that ever followed it in the text.

: word-map ( sequence -- assoc )
    [ H{ } clone ] dip word-pairs [
        [ second ] [ first ] bi pick push-at
    ] each ;

This makes sampling words with the proper probability quite simple:

: next-word ( word assoc -- word' )
    at random ;

We can now generate paragraphs of random text with the "feel" of the original document.

: wordgen-from ( start n -- string )
    [ [ 1vector ] keep split-words word-map ] dip
    [ [ next-word dup pick push ] keep ] times
    2drop " " join ;

And for convenience, as in the original blog posts, we can start generating from a common English word.

: wordgen ( n -- string )
    "the" swap wordgen-from ;

Putting all of this together, we can make 250 words from Dracula:

( scratchpad ) "/tmp/345.txt" ascii [ 250 wordgen ] with-file-reader .
"the bank notes of electronic works based on the same time to you are fading and it came I too much Without taking a quick he had struck by our eyes and she was not asleep twice when she sank down again There were striving to think yet when the Huns This to get to her all at her on it really Lucy's father Goodbye my own time when she look at the facts before them away and his diary which is out Look! Look! Look! The bright as ever since then we had said Dr Van Helsing will be a chair with you said Yes! And it was a pistol shot up to write for a woman can love and added Friend John and pain nor a long enough You were so seeing it high tide altogether I awoke She is confined within a room lit by the old sweet that the temptation of lion-like disdain His cries are great love must go there would tear my eyes the colour of him and quiet; but still Renfield sitting on one ready Madam Mina Harker had retired early Lucy and there ran downstairs then we get out his fair accuracy when I found that my dear do our visit to oppose him can look had a measure of the other since I secured and seemed to tell upon your all our furs and the face I put in his eyes never be This afternoon she was that such horrors when the"

Not quite Bram Stoker, but just enough to taste the flavor. More complex algorithms exist and can be used for fun and profit. Similar ideas can even be applied to other areas, such as music.

2 comments:

Slava Pestov said...

You can get rid of the 'over' and 'pick' usages by using fry syntax to insert values into the quotation you pass to combinators, rather than having to manually reach for it on each iteration.

: word-map ( sequence -- assoc )
word-pairs H{ } clone [ '[ [ second ] [ first ] bi _ push-at ] each ] keep ;

Also, 'make' can simplify split-words further,

: split-words ( -- seq ) [ [ "\r\n\t.,\" " split % ] each-line ] { } make harvest ;

Have fun.

mrjbq7 said...

Thanks!

It's always great to get your feedback. I'm enjoying using Factor, and continuing to build small and large programs is helping give me a sense of Factor style (I hope).