CS 480 Python Assignment 3

Assigned: 8 April 2009   Due: 24 April 2009

In this assignment you will experiment with and extend some existing Python code to do stylometry. Stylometry is the statistical analysis of a written text to try to identify characteristics of the author's writing style and (perhaps) to identify the authorship of anonymous or disputed works.

You will start with my code in the file mytext.py which is based on the text.py file from the textbook authors. This code in turn requires the files agents.py, logic.py, probability.py, search.py and utils.py. I suggest that you download all of these files to a new directory. In this same directory you can also download the text files that you can explore: bleakhouse.txt, greatexpectations.txt, and littledorrit.txt by Charles Dickens and persuasion.txt, pride.txt (aka Pride and Prejudice) and sense.txt (aka Sense and Sensibility) by Jane Austen. You might also want the text flatland.txt to run my existing code and the file unknown.txt a text whose authorship you will try to identify.

After downloading all of these files and "compiling" mytext.py you can look at the function mycode() toward the bottom of this file. This has my reasonably well documented code that does various stylometric analyses of the texts Flatland, Sense and Sensibility and Pride and Prejudice. You might look at my effort as trying to show that Pride is more likely to be written by the author of Sense than by the author of Flatland. We will discuss this code more in class.

Your assignment is to do stylometric analyses to compare the works of Dickens and Austen, then try to identify the author of the unknown text. You should write up a short paper that describes (in English, perhaps laced with code) the tests that you ran and the conclusion that you have come to. I hope you will be able to extend the tests that I used and come up with more of your own. You should present your data in readable, easily interpretable ways. You are welcome to limit your experiments to a subset of the data, e.g., you don't have to carefully analyze all 6 known works.

You should hand in a paper copy of your 3 to 5 page report that discusses your results and conclusions. You should also upload the Python code you used to produce your results (presumably just the changed version of mytext.py).