LIN371 Machine Learning for Text Analysis

Syllabus

Course: LIN 371, Machine Learning for Text Analysis, 39030 (formerly LIN 373N)
Semester: Fall 2024
Webpage: http://jessyli.com/courses/lin371
Canvas: https://utexas.instructure.com/courses/1402044
Chatter: We will be using Chatter (accessable from Canvas) for discussions, QA, etc.
Meeting: MWF 1-2pm, GDC 2.210
Prerequisites: Basic knowledge in (Python) programming and prior knowledge of elementary probability theory is assumed.
Flags: Quantitative Reasoning, Independent Inquiry

Contact information

Instructor: Jessy Li; email: jessy@austin.utexas.edu
- Office hours: M/W 12-1. Location: RLP 4.728.
TA: Sooji Lee; email: sooji.lee@utexas.edu
- Office hours: T/TH 11-12. Location: Linguistics suite, RLP 4th floor east side cubicles, #E6

Course materials

Readings are provided via hyperlinks under Schedule.
Exercises, code, etc. will be posted on Canvas.
For your Python distribution, I suggest the Anaconda distribution of Python 3.

Course overview and objectives

Human langauge technology, to which large language models like ChatGPT belongs, has made amazing strides. Such technology lets us do things like automatically translate from one language to another, analyze what people on social media think about a current topic, or even write code with a copilot. Machine learning plays a central role in this technology: software that can learn from experience. This course provides an overview of basic methods for machine learning and natural language processing. This is a very hands-on course in which we are going to be using the Python programming language.

The first half of this course focuses on basic machine learning algorithms and applying them to analyze data, much of which textual. The later portion of the course will shift to natural language processing and to class projects, which will be a major component of the course grade. These projects will allow you to pursue your own interests (and conduct new research in so doing!).

Topics of this course include:

Python programming for data analysis (including relevant libraries for modeling + data munging)
Text processing and workflows for text analysis
Supervised and unsupervised learning
Natural langauge processing and applications

Acknowledgements: I thank Byron Wallace for sharing his syllabus, materials, and experiences from his course Applied Data Mining for the initial development of this course.

Flags

This course carries the Quantitative Reasoning flag. Quantitative Reasoning courses are designed to equip you with skills that are necessary for understanding the types of quantitative arguments you will regularly encounter in your adult and professional life. You should therefore expect a substantial portion of your grade to come from your use of quantitative skills to analyze real-world problems.

This course also carries the Independent Inquiry flag. Independent Inquiry courses are designed to engage you in the process of inquiry over the course of a semester, providing you with the opportunity for independent investigation of a question, problem, or project related to your major. You should therefore expect a substantial portion of your grade to come from the independent investigation and presentation of your own work.

Course requirements and grading policy

Four homeworks, 60%
- These will involve programming, and you will be required to turn in your code.
Class project, 40%
- This will be an open-ended project about a predictive model you would like to build, or a task you would like to accomplish (e.g., predicting the emotion of a Reddit post, summarizing news documents, etc.). Work with the instructor to find a topic that (importantly!) you are interested in or excited about.
- Course projects should be done by teams of 2 students. Project groups consisting of 1 or 3 students are possible only with prior approval of the instructor.
- Milestones:
  - An initial presentation and a 2-page written proposal of your project;
  - Final project presentation;
  - Project writeup (4 pages) and source code.
The course will use plus-minus grading, using the following scale:

Grade	Percentage
A	>= 93%
A-	>= 90%
B+	>= 87%
B	>= 83%
B-	>= 80%
C+	>= 77%
C	>= 73%
C-	>= 70%
D+	>= 67%
D	>= 63%
D-	>= 60%

Extension policy

Slip days: You will have a total of 6 free slip days that you can use throughout the semester. You can choose how many days you want to use, and however you want to distribute them; e.g., you can use one slip day per assignment, or 4 days for one assignment and none for others. Slip days cannot be used fractionally: submitting an assignment 1 hour late incurs 1 slip day, 25 hours late incurs 2 slip days, etc.
Extension permissions: Beyond the slip days, extensions may be granted on a case-to-case basis due to medical emergency or other circumstances that are extraordinary or emergencies in nature. You must reach out to the instructor to obtain an extension before the deadline in question.
Late penalty: If you have used up your free slip days and did not obtain a permission on extension, then by default, 10 points (out of 100) will be deducted for lateness, plus an additional 5 points for every 24-hour period beyond 2 that the assignment is late. Resubmissions of assignments are allowed; extension penalty applies for post-deadline resubmissions.

Academic dishonesty policy

You are encouraged to discuss assignments with classmates. But all coding/written work must be your own. Students caught cheating will automatically fail the course. If in doubt, ask the instructor.

Notice about students with disabilities

The University of Texas at Austin provides upon request appropriate academic accommodations for qualified students with disabilities. Please contact the Division of Diversity and Community Engagement, Services for Students with Disabilities, 512-471-6259.

Notice about missed work due to religious holy days

A student who misses an examination, work assignment, or other project due to the observance of a religious holy day will be given an opportunity to complete the work missed within a reasonable time after the absence, provided that he or she has properly notified the instructor. It is the policy of the University of Texas at Austin that the student must notify the instructor at least fourteen days prior to the classes scheduled on dates he or she will be absent to observe a religious holy day. For religious holy days that fall within the first two weeks of the semester, the notice should be given on the first day of the semester. The student will not be penalized for these excused absences, but the instructor may appropriately respond if the student fails to complete satisfactorily the missed assignment or examination within a reasonable time after the excused absence.

Senate Bill 212 and Title IX Reporting Requirements

Under Senate Bill 212 (SB 212), the professor and TAs for this course are required to report for further investigation any information concerning incidents of sexual harassment, sexual assault, dating violence, and stalking committed by or against a UT student or employee. Federal law and university policy also requires reporting incidents of sex- and gender-based discrimination and sexual misconduct (collectively known as Title IX incidents). This means we cannot keep confidential information about any such incidents that you share with us. If you need to talk with someone who can maintain confidentiality, please contact University Health Services (512-471-4955 or 512-475-6877) or the UT Counseling and Mental Health Center (512-471-3515 or 512-471-2255). We strongly urge you make use of these services for any needed support and that you report any Title IX incidents to the Title IX Office.

No materials used in this class, including, but not limited to, lecture hand-outs, videos, assessments (quizzes, exams, papers, projects, homework assignments), in-class materials, review sheets, and additional problem sets, may be shared online or with anyone outside of the class unless you have my explicit, written permission. Unauthorized sharing of materials promotes cheating. It is a violation of the University’s Student Honor Code and an act of academic dishonesty. I am well aware of the sites used for sharing materials, and any materials found online that are associated with you, or any suspected unauthorized sharing of materials, will be reported to Student Conduct and Academic Integrity in the Office of the Dean of Students. These reports can result in sanctions, including failure in the course.

Schedule

Schedule is tentative and subject to change.

Week 1 (8/26-8/30)
- Introduction (slides)
  - A YouTube tutorial of Jupyter notebooks
  - Think Python is a good intro-to-Python book!
  - Here is another one: Starting Out with Python
- Probability review
  - slides
  - Readings:
    - Appendix A of Eisenstein’s NLP book
    - Optional: Sections 2.1-2.3 of Bishop’s book
Week 2 (9/3-9/6)
- Decision tree demo
- Maximum likelihood and MAP
  - slides
  - Readings:
    - Sections 9.1 & 9.2 of Daume III’s A Course in Machine Learning
    - Mitchell, Machine Learning, Chapter 2
- Decision Trees
  - slides
  - Readings:
    - Sections 1.1 & 1.2 of Daume III’s A Course in Machine Learning
    - Sections 3.1-3.4 of Mitchell’s Machine Learning book
Week 3 (9/9-9/13)
- Naive Bayes
  - slides
  - Readings:
    - Sections 4.1-4.5, Jurafsky & Martin
    - Optional: Collins’ handout on the maximum likelihood derivation for Naive Bayes Sections 2-4.
- 9/13: Homework 1 due midnight
Week 4 (9/16-9/20)
- Naive Bayes demo
- Basic language modeling: ngrams
  - slides
  - Readings: Sections 3.1-3.5, Jurafsky & Martin
Week 5 (9/23-9/27)
- Linear regression, logistic regression
  - slides
  - Readings:
    - Kolter, Linear Algebra Review and Reference
    - Sections 5.1-5.7, Jurafsky & Martin
- ML workflow and evaluation paradigms
  - slides
- 9/27: Homework 2 due midnight
Week 6 (9/30-10/4)
- Feature vectorization demo
  - Readings: Chapter 5 of Daume III’s A Course in Machine Learning
- Perceptrons
  - slides
  - Readings: Chapter 4 of Daume III’s A Course in Machine Learning
Week 7 (10/7-10/11)
- Deep learning: multi-layer perceptrons
  - slides
  - Readings: Sections 7.1-7.4, Jurafsky & Martin
- Neural language models
  - slides
  - Readings: Section 7.5, Jurafsky & Martin
- 10/11: Homework 3 due midnight
Week 8 (10/14-10/18)
- Word embeddings and demo
  - Readings: Sun et al., Mitigating Gender Bias in Natural Language Processing: Literature Review
- Princple Component Analysis
  - slides
  - Readings: Section 15.2 of Daume III’s A Course in Machine Learning
- Recurrent neural nets
  - slides
  - Readings: Sections 9.1-9.3, Jurafsky & Martin
Week 9 (10/21-10/25)
- Project proposal presentations
- 10/25: Project proposal due midnight
Week 10 (10/18-11/1)
- Sequence-to-sequence models
  - slides
  - Readings: Section 9.4, Jurafsky & Martin
- Attention, Transformers
  - slides
  - Readings: Chapter 11, Jurafsky & Martin
Week 11 (11/4-11/8)
- Demo: building a neural network
  - code: PyTorch basics
- 11/8: Homework 4 due midnight
Week 12 (11/11-11/15)
- Demo: BERT
  - See Hongli’s completed BERT code here
- (Optional) Hongli’s notebook for BART, a pre-trained seq2seq model
Week 13 (11/18-11/22)
- Large Language Models and Demo
  - Stanford slide deck on LLMs
  - Readings: Chapter 10, Jurafsky & Martin
  - Chat templates from HuggingFace
- Unsupervised learning
  - slides
  - Readings:
    - Blei, Probabilistic Topic Models
    - Section 15.1 of Daume III’s A Course in Machine Learning
Week 14 (11/25-11/29), Thanksgiving break
Week 15 (12/2-12/6)
- Project presentations
Week 16 (12/9)
- Applications of NLP.
Project writeup due: Dec 13, 11:59pm. Note: due to grading constraints, hard deadline of the report will be Dec 15, 11:59pm

Project

Please refer to the grading policy for a high level overview about the project and requirements.

Topic suggestions

Emotion detection during COVID-19, a Twitter dataset here(paper) and a Reddit dataset here(paper). For the CovidET dataset, using large language models to identify what triggers an emotion.
Detect whether a tweet from a politician is mentioning someone within their own party or not; data here and paper here.
Pragmatic function of vulgar expressions in social media: here is a collection of tweets containing vulgar words. Build a model to predict the pragmatic function (emphatic, expressing emotions, etc) of each vulgar word.
Specificity captures the level of details in a text. Here is a set of tweets marked with their specificity levels. Build a regression model to predict specificity given a tweet.
The CNN/Daily Mail dataset is often used for summarization: generating highlights from scratch given a news article. To evaluate the summaries you generate against human summaries, use ROUGE; python wrappers here or here.
The DailyDialog dataset contains a series of dialogs with labeled topics, dialog acts and emotions. Create models to predict them! Of course, you can also try to generate responses given the inputs and create a chatbot.

Detailed requirements

Project proposal:
- A written proposal (2 pages) that describes:
  - What the project is about, and why is this interesting/important?
  - What data/resources will you use?
  - What type of machine learning algorithms or models that you plan to apply?
  - How do you plan to evaluate your system? If applicable, describe quantitative metrics you will use.
  - If you are working in a group: who does what.
- A proposal presentation (5-6 minutes) that summarizes the above.
Final project deliverables:
- A written final report (4 pages) that builds on your proposal:
  - What the project is about, and why is this interesting/important?
  - What data/resources will you use?
  - What type of machine learning algorithms or models you applied?
  - Results: Describe as clearly as possible what it is your system can (and cannot) do. You can show examples of things your system is getting correct and of errors it is making. If applicable, measure performance by some performance measure.
  - If you are working in a group: separate section describing who did what.
  - Source code needs to be submitted along with the final report.
- A final presentation (7-8 minutes) that summarizes the above.