LIN373 Machine Learning Toolbox for Text Analysis

Syllabus

Course: LIN 373, Machine Learning Toolbox for Text Analysis, 40135
Semester: Spring 2019
Webpage: http://jessyli.com/courses/lin373
Canvas: https://utexas.instructure.com/courses/1243553
Meeting: Tuesday/Thursday 11:00am-12:30pm, SZB 380
Prerequisites: Prior knowledge of elementary probability is preferred. Programming experience is not required; the course includes an introduction to programming in Python.
Flag: Quantitatve

Contact information

Instructor: Jessy Li; email: jessy@austin.utexas.edu
- Office hours: W 11:00-1:00, F 11:00-12:00, RLP 4.728
TA: Laura Manor; email: manor@utexas.edu
- Office hours: M 9-10:30, Tu 9-10:30, RLP 4.400, desk E4

Course materials

We will primarily be using Joel Grus: Data Science from Scratch: First Principles with Python (2015) as our text.
Additional readings are provided via hyperlinks on the Schedule page.
Exercises, code, etc. will be posted on Canvas.
For your Python distribution, I suggest the Anaconda distribution of Python 3.

Course overview and objectives

Technology that automatically analyzes text has made amazing strides, and lets us do things like automatically translate from Chinese to English, summarize what people on Twitter think about some current political topic, or find clues on who the author is of some classic piece of literature. Machine learning plays a central role in this technology: software that can learn from experience. This course provides an overview of basic statistical methods for machine learning, with an emphasis on applications that have to do with text. This is a very hands-on course in which we are going to be using the Python programming language.

We will start with foundations including basic probability/statistics and python programming. The bulk of the course focuses on machine learning methods and applying them to analyze data, much of which textual. The later portion of the course will shift to surveying several tasks in natural language processing and to class projects, which will be a major component of the course. These projects will allow you to pursue your own interests (and conduct new research in so doing!).

Topics of this course include:

Python programming for data analysis (including relevant libraries for modeling + data munging)
Supervised and unsupervised learning
Text processing and workflows for text analysis
Applications

Acknowledgement: I thank Byron Wallace for sharing his syllabus, materials, and experiences from his course Applied Data Mining.

Quantitative Reasoning Flag

This course carries the Quantitative Reasoning flag. Quantitative Reasoning courses are designed to equip you with skills that are necessary for understanding the types of quantitative arguments you will regularly encounter in your adult and professional life. You should therefore expect a substantial portion of your grade to come from your use of quantitative skills to analyze real-world problems.

Course requirements and grading policy

Four homeworks, 60%
- These will involve programming, and you will be required to turn in your code.
Class project, 40%
- This will be an open-ended project about a predictive model you would like to build, or a task you would like to accomplish (e.g., predicting the pragmatic function of swear words, summarizing news documents, etc.). Work with the instructor to find a topic that (importantly!) you are interested in or excited about.
- Course projects should be done by teams of 2 students. Project groups consisting of 1 or 3 students are possible only with prior approval of the instructor.
- Milestones:
  - An initial presentation and a 1-2 page written proposal of your project;
  - Final project presentation;
  - Project writeup (3-4 pages) and source code.
The course will use plus-minus grading, using the following scale:

Grade	Percentage
A	>= 93%
A-	>= 90%
B+	>= 87%
B	>= 83%
B-	>= 80%
C+	>= 77%
C	>= 73%
C-	>= 70%
D+	>= 67%
D	>= 63%
D-	>= 60%

Attendance is not required, and it is not used as part of determining the grade.

Extension policy

Extensions will be considered on a case-by-case basis, but in most cases they will not be granted. If an extension has not been agreed on beforehand, then for assignments, by default, 5 points (out of 100) will be deducted for lateness, plus an additional 1 point for every 24-hour period beyond 2 that the assignment is late. The maximum extension penalty is 40 points if handed in before the last day of class. Resubmissions of assignments are allowed; extension penalty applies for post-deadline resubmissions.

Note that there are always some points to be had, even if you turn in your assignment late. So if you would like to know if you should still turn in the assignment even though it is late, the answer is always yes.

Academic dishonesty policy

You are encouraged to discuss assignments with classmates. But all written work must be your own. Students caught cheating will automatically fail the course. If in doubt, ask the instructor.

Notice about students with disabilities

The University of Texas at Austin provides upon request appropriate academic accommodations for qualified students with disabilities. Please contact the Division of Diversity and Community Engagement, Services for Students with Disabilities, 512-471-6259.

Notice about missed work due to religious holy days

A student who misses an examination, work assignment, or other project due to the observance of a religious holy day will be given an opportunity to complete the work missed within a reasonable time after the absence, provided that he or she has properly notified the instructor. It is the policy of the University of Texas at Austin that the student must notify the instructor at least fourteen days prior to the classes scheduled on dates he or she will be absent to observe a religious holy day. For religious holy days that fall within the first two weeks of the semester, the notice should be given on the first day of the semester. The student will not be penalized for these excused absences, but the instructor may appropriately respond if the student fails to complete satisfactorily the missed assignment or examination within a reasonable time after the excused absence.

Emergency evacuation policy

Occupants of buildings on The University of Texas at Austin campus are required to evacuate buildings when a fire alarm is activated. Alarm activation or announcement requires exiting and assembling outside. Familiarize yourself with all exit doors of each classroom and building you may occupy. Remember that the nearest exit door may not be the one you used when entering the building. Students requiring assistance in evacuation shall inform their instructor in writing during the first week of class. In the event of an evacuation, follow the instruction of faculty or class instructors. Do not re-enter a building unless given instructions by the following: Austin Fire Department, The University of Texas at Austin Police Department, or Fire Prevention Services office. Information regarding emergency evacuation routes and emergency procedures can be found at http://www.utexas.edu/emergency.

Schedule

Schedule is tentative and subject to change.

Week 1
- 1/22 Introduction (code & data)
- 1/24 Python session (code)
  - Readings: Think Python chapter 1, chapter 2 and chapter 3 up to section 3.4. Warning: This online book uses Python 2, not Python 3. The main difference that you will see is that it uses print() without the (). This doesn’t work in Python 3!
  - Install the Anaconda Python Distribution; make sure to use the Python 3.7 version. Note: if you choose not to use the Anaconda Distribution, make sure you have Jupyter Notebook, the Numpy, Scipy and ScikitLearn packages installed.
Week 2
- 1/29 Basic probability and stats review
  - Readings: Grus, Chapters 5 & 6; Eisenstein, Appendix A
- 1/31 Python session (functions) (code)
  - Readings: Think Python chapter 3, chapter 6
Week 3
- 2/5 Basic probability and stats review, continued: distributions, MLE, MAP (code & data)
- 2/7 Python session (conditions, loops) (code)
  - Readings: Think Python chapter 5, Learning with Python chapter 6
Week 4
- 2/12 Decision Trees, basic linear algebra review (code)
  - Readings
    - Grus, Chapters 8, 11, 16
    - Domingos, A Few Useful Things to Know about Machine Learning
  - Linear Algebra Review and Reference from Stanford CS229
  - Random Forests
- 2/14 Python session (recursion, dictionaries) (code)
- 2/15 Homework 1 due midnight
Week 5
- 2/19 Basic linear algebra review, linear regression, logistic regression (code)
- 2/21 Python session (list comprehension, file handling, basic object-oriented programming) (code)
  - Readings: Learning with Python chapter 14, chapter 15, chapter 17
Week 6
- 2/26 Logistic Regression, Naive Bayes (code)
  - Readings:
    - Grus, Chapters 13, 18.1
    - Please also read Chapters 14 (linear regression) and 17 (decision trees), both of which we have covered.
- 2/28 Naive Bayes, OOP, text processing tools (code)
- 3/1 Homework 2 due midnight
Week 7
- 3/5 Perceptron, evaluation
- 3/7 Evaluation/Demo/Unsupervised learning (code)
  - Readings: Grus, Chapter 19
Week 8
- 3/12 Unsupervised learning
- 3/14 Unsupervised learning 2 (code)
- 3/15 Homework 3 due midnight
Week 9
- Spring break
Week 10
- 3/26 Project proposal presentations
- 3/28 Topic modeling demo (code) / Deep learning: multi-layer perceptrons
  - Note: topic modeling demo needs Gensim and pyLDAvis
  - Readings: Grus, Chapter 18 (Sections 18.2 and onwards)
- 3/29 Project proposal writeup due midnight
Week 11
- 4/2 Language models / word embeddings
- 4/4 Word embeddings
  - Optional reading:
    - Notes about word embeddings from Stanford CS 224D
    - Chris McCormick’s post about the skip-gram model
Week 12
- 4/9 Word embedding demo / Recurrent neural nets
  - Optional reading:
    - Shrey Desai’s blog post on backprop
    - Christopher Olah’s blog post that digs deeper into non-linearity and activation functions
  - Homework 4 due midnight
- 4/11 Recurrent neural nets: applications
Week 13
- 4/16 Keras tutorial
- 4/18 Crowdsourcing
  - Reading: Amazon Mechanical Turk: Gold Mine or Coal Mine?
Week 14
- 4/23 Topics in NLP: vulgar language in social media / Structured Prediction
  - Readings:
    - Chapter 17 of Hal Daume III’s book A Course in Machine Learning
    - Optional: UT research on vulgar language: paper on sentiment; paper on pragmatics.
- 4/25 Structured Prediction / CNNs
  - Optional readings:
Week 15
- 4/30 Dialog systems
- 5/2 Summarization
Week 16: Project presentations
- 5/7
- 5/9
Project writeup due: Tuesday May 14 by 11:59pm

Project

Please refer to the grading policy for a high level overview about the project and requirements.

Topic suggestions

Sentiment analysis: extend our existing investigations of the Sentiment140 dataset or the Movie Reviews dataset to consider more features, fancier models, and/or more detailed analysis.
Pragmatic function of vulgar expressions in social media: here is a collection of tweets containing vulgar words. Build a model to predict the pragmatic function (emphatic, expressing emotions, etc) of each vulgar word.
Congressional floor debates is a dataset containing transcripts of U.S. floor debates in the House of Representatives in 2005. Build a model to predict the vote (“yea” vs “ney”) from the speaker of each segment.
Specificity captures the level of details in a text. Here is a set of tweets marked with their specificity levels. Build a regression model to predict specificity given a tweet.
Running topic models and analyzing topics across a corpus of text, e.g., the Brown corpus (available with NLTK).
Multi-document summarization aims to extract the most important sentences from a set of documents under the same topic. Explore unsupervised ways to create such summaries. Here are several sets of articles, along with human summaries for each set. To evaluate the summaries you generate against human summaries, use ROUGE; python wrappers here or here.
The DailyDialog dataset contains a series of dialogs with labeled topics, dialog acts and emotions. Create models to predict them! Of course, you can also try to generate responses given the inputs and create a chatbot.

Detailed requirements

Project proposal:
- A written proposal (1-2 pages) that describes:
  - What the project is about, and why is this interesting/important?
  - What data/resources will you use?
  - What type of machine learning algorithms or models that you plan to apply?
  - How do you plan to evaluate your system? If applicable, describe quantitative metrics you will use.
  - If you are working in a group: who does what.
- A proposal presentation (3-4 minutes) that summarizes the above.
Final project deliverables:
- A written final report (3-4 pages) that builds on your proposal:
  - What the project is about, and why is this interesting/important?
  - What data/resources will you use?
  - What type of machine learning algorithms or models you applied?
  - Results: Describe as clearly as possible what it is your system can (and cannot) do. You can show examples of things your system is getting correct and of errors it is making. If applicable, measure performance by some performance measure.
  - If you are working in a group: separate section describing who did what.
  - Source code needs to be submitted along with the final report.
- A final presentation (6 minutes) that summarizes the above.