Programming from A to Z
- Spring 2008, Tuesdays 3:30 - 6:00 pm
- Daniel Shiffman, daniel.shiffman@nyu.edu
- Office Hours Signup
Important Sites
Syllabus
Week 1 -- Jan 22 -- The beginning
Week 2 -- Jan 29 -- Regular Expressions
Week 3 -- Feb 5 -- The Concordance
- Binary Search Tree
- Updated File I/O
- Concordance
- Reading: Algorithmic Efficiency -- Beating a Dead Horse Faster
- Reading: What is Text Analysis from the Text Analysis Developers Alliance Wiki
- Additional notes: flickr 1, flickr 2
- Assignment: Choose one of the following exercises that expands the concordance functionality (or develop your own expansion). How might you use this basic form of analysis in a project you are considering working on in relation to this course? Run the concordance on a few texts. Post your thoughts and results to the wiki.
- Instead of merely printing the concordance information to the screen, write it out in a report file. Create a class part of the a2z package (modeled after A2ZFileReader) that manages writing a String to an output file.
- Expand the information the concordance holds so that it keeps track of word positions (i.e. not only how many times do the words appear in the source text, but where do they appear each time.)
- Take the concordance results into a graphics environment (such as Processing) and visualize. Check the CVS project "a2z_week3" for an example of Processing in Eclipse running the Concordance. Also, visit this tutorial of using Processing in Eclipse.
- Port the concordance to another environment, such a perl or php.
- Advanced! Re-write the tree so that it holds "generic" data, i.e. any object that implements the Comparable interface.
- Super advanced! Research balanced binary trees. Rewrite the Tree class with a function that re-balances the data.
Week 4 -- Feb 12 -- Bayesian Text Analysis
- Note: The code in the CVS project is more up-to-date than on the tutorial handout page!
- Hash Tables
- Bayes' Rule
- Spam Filtering using Bayesian Analysis
- A Plan for Spam and Better Bayesian Filtering by Paul Graham. This essay is also available in the book Hackers and Painters.
- An Intuitive Explanation of Bayesian Reasoning -- quite long, but truly amazing if you're interested in delving deeper into the math behind Bayesian analysis.
- Introduction to Bayesisan Filtering (suggested)
- Suggested Reading from the Nora Project
- For a Perl implementation: Naive Bayesian Text Classification (sorry, you have to register).
- Assignment: Use bayesian text classification towards some end. If you have trouble working with the code, simple run the example with different training texts (Shakespeare vs. Chekov, New York Times vs. New York Post, etc.) and evaluate its performance. What could be improved? Feel free to write a proposal in lieu of code. Here are some suggested code expansions:
- Rename variables and functions to make the code more generic (is a text category A or category B, for example). In addition, remove the spam filtering bias from the example to make it more generic. These are (a) doubling the count of "good" words and (b) using a spam probability of 0.4 for unknown words (this should arguably be 0.5 for other cases).
- Implement some of the improvements suggested in Graham's better bayesian filtering. These include improving the tokenization, using word counts to revise probabilities for words that only appear in spam or non-spam e-mails (0.99 and 0.01). Also, my example uses the total number of words in the training set to calculate the likelihood of a word appearing in spam (or non-spam). Change this to the of total number of e-mails (not you will have to be more clever about loading the e-mail files themselves).
- Expand the example to classify text among more than 2 categories. We will discuss solutions in class (or come see me if this interests you).
- Use the example to auto-generate keywords about a text. For example, what makes your e-mails unique compared to the world at large. Can you "auto-tag" text using bayesian classification. Revisit The Nora Project for some inspiration.
- Picking up on the previous two assignments, can you use bayesian classification to find similarities between people. For example, if you look at all the e-mails on the ITP student list, can you determine who is similar? Consider using properties in addition to word count, such as time of e-mails, length of e-mails, etc. (Thanks to Chris Kairalla who implemented something like this last year.)
Week 5 -- Feb 19 -- Spiders
- URL grabbing
- Linked Lists
- Being Polite
- Finding new URLs
- A Crawler Class
- Reading: Guidelines for Robot Writers -- a must, our own crawler is very impolite!
- Reading: The Anatomy of a Large-Scale Hypertextual Web Search Engine -- Google circa 1998!
- Reading: SPHINX: a framework for creating personal, site-specific Web crawlers, note to self: implement an example that uses a custom classifier (if someone is interested, remind me!)
- Assignment: Write a short proposal for a “midterm” project. Document with pseudo-code, sketches, links to related material, inspiration, etc. and post to the wiki. Develop a plan for implementing one (very small) piece of the project (or a scaled down demo) — remember, this is just a two week assignment, no need to overshoot the implementation! The proposal is due next week with a prototype / demo due the following week. We’ll look at everyone’s proposal next week and the project itself the week after.
Week 6 -- Feb 26 -- Getting Data
- HTML
- XML
- APIs
- Reading (suggested): Visualizing Data by Ben Fry -- chapters 5,9,10
- Assignment: Complete midterm project documentation, post link to wiki and be prepared to present next week. Remember, this is just a one week programming assignment.
Week 7 -- Mar 4 -- Midterm Workshop
Week 8 -- Mar 11 -- Generative Text
Week 9 -- Mar 25 -- WordNet
- WordNet
- RiTa WordNet
- Reading: WordNet: a lexical database for English, George A. Miller (Note you must visit this site on the NYU Network or via the NYU proxy).
- Reading: WordNet on Wikipedia
- Reading: WordNet bibilography (optional, you may find some papers here helpful for specific projects / ideas)
- Guest: Daniel Howe
- Homework: Final Project Proposals. Include in your proposal a title, brief description, and links to things you want to show / talk about -- work that inspired you, reference pages, sketches you've made, sample programs/code, previous projects, etc.
Week 10 -- Apr 1 -- Threads, Drawing Text, and Final Project Proposals
Week 11 -- Apr 8 -- final project proposals
Week 12 -- Apr 15 -- individual meetings / workshop help
Week 13 -- Apr 22 -- final project presentations
Week 14 -- Apr 29 -- final project presentations
Homework
Students are required to complete a programming exercise each week. Documentation for each assignment should be posted to the linked wiki page.
Requirements: (no incompletes)
- 50% homeworks
- 30% final project
- 20% class participation, attendance