Today, huge amounts of text are available in electronic form. We can poke these electronic text collections to answer questions about language, and questions about the people who use it. For example, we can test whether passive constructions are increasingly falling out of favor in English, and we can trace how words change their meaning over time. We can also study a politician’s word choices in political debates to find out more about their personality, or we can see how inaugural addresses have changed over time.
This course provides a hands-on introduction to working with text data. This includes an introduction to programming in Python, with a focus on text processing and data exploration, with a “cookbook” of programming examples that will enable you very quickly to analyze texts on your own. Most of the conclusions that we want to draw from text are “risky conclusions”, they are trends rather than yes-or-no answers, so the course also includes an introduction to statistical techniques for data exploration and for making and assessing “risky conclusions”. The course also includes a course project where you can test your text analysis skills on a question of your own choice.
By the end of this course, you will:
know how to use simple word counts to answer many questions about people and about language, and know how to choose the right words for counting
know how to write programs in the Python programming language to access and analyze texts
know how to visualize and graph descriptive statistics about texts
know what hypothesis tests in statistics are, know some types of hypothesis tests, and know how to implement them in practice using Python packages
know what basic regression models in statistics are, know what they are used for, and know how to implement them in practice using Python packages
be familiar with a toolkit of linguistic text preprocessing tools, and know how to use it to normalize and filter words in a text
know what hypothesis testing is, and how to use it to distinguish actual findings from random variations in the data
know how clustering and topic modeling can be used to gain a quick overview of topics and themes that appear in written texts, and know how to apply these techniques in practice using Python packages
It is crucial for your success in this class that you attend the lectures, do the in-class exercises and participate in in-class discussions. The TA will keep an attendance sheet. Please remember to enter your name into the attendance sheet each time you come to class. You can have three missed class sessions without penalty. For each missed class sessions beyond three, your attendance grade will decrease by 4 out of 100 points. Exceptions to this rule (due to medical emergencies, etc.) are at the discretion of your teacher. An important rule of thumb for an extension-related conversation is be communicative, be proactive, and let us know ahead of time.
You are encouraged to discuss assignments with classmates. But all coding/written work must be your own. Students who violate University rules on scholastic dishonesty are subject to disciplinary penalties, including the possibility of failure in the course and/or dismissal from the University. Since such dishonesty harms the individual, all students and the integrity of the University, policies on scholastic dishonesty will be strictly enforced. For further information, please visit the Office of Student Conduct and Academic Integrity website.
The University of Texas at Austin provides upon request appropriate academic accommodations for qualified students with disabilities. Please contact the Division of Diversity and Community Engagement, Services for Students with Disabilities, 512-471-6259.
A student who misses an examination, work assignment, or other project due to the observance of a religious holy day will be given an opportunity to complete the work missed within a reasonable time after the absence, provided that he or she has properly notified the instructor. It is the policy of the University of Texas at Austin that the student must notify the instructor at least fourteen days prior to the classes scheduled on dates he or she will be absent to observe a religious holy day. For religious holy days that fall within the first two weeks of the semester, the notice should be given on the first day of the semester. The student will not be penalized for these excused absences, but the instructor may appropriately respond if the student fails to complete satisfactorily the missed assignment or examination within a reasonable time after the excused absence.
Under Senate Bill 212 (SB 212), the professor and TAs for this course are required to report for further investigation any information concerning incidents of sexual harassment, sexual assault, dating violence, and stalking committed by or against a UT student or employee. Federal law and university policy also requires reporting incidents of sex- and gender-based discrimination and sexual misconduct (collectively known as Title IX incidents). This means we cannot keep confidential information about any such incidents that you share with us. If you need to talk with someone who can maintain confidentiality, please contact University Health Services (512-471-4955 or 512-475-6877) or the UT Counseling and Mental Health Center (512-471-3515 or 512-471-2255). We strongly urge you make use of these services for any needed support and that you report any Title IX incidents to the Title IX Office.
No materials used in this class, including, but not limited to, lecture hand-outs, videos, assessments (quizzes, exams, papers, projects, homework assignments), in-class materials, review sheets, and additional problem sets, may be shared online or with anyone outside of the class unless you have my explicit, written permission. Unauthorized sharing of materials promotes cheating. It is a violation of the University’s Student Honor Code and an act of academic dishonesty. I am well aware of the sites used for sharing materials, and any materials found online that are associated with you, or any suspected unauthorized sharing of materials, will be reported to Student Conduct and Academic Integrity in the Office of the Dean of Students. These reports can result in sanctions, including failure in the course.
Occupants of buildings on The University of Texas at Austin campus are required to evacuate buildings when a fire alarm is activated. Alarm activation or announcement requires exiting and assembling outside. Familiarize yourself with all exit doors of each classroom and building you may occupy. Remember that the nearest exit door may not be the one you used when entering the building. Students requiring assistance in evacuation shall inform their instructor in writing during the first week of class. In the event of an evacuation, follow the instruction of faculty or class instructors. Do not re-enter a building unless given instructions by the following: Austin Fire Department, The University of Texas at Austin Police Department, or Fire Prevention Services office. Information regarding emergency evacuation routes and emergency procedures can be found at http://www.utexas.edu/emergency.
If you are worried about someone who is acting differently, you may use the Behavior Concerns Advice Line to discuss by phone your concerns about another individual’s behavior. This service is provided through a partnership among the Office of the Dean of Students, the Counseling and Mental Health Center (CMHC), the Employee Assistance Program (EAP), and The University of Texas Police Department (UTPD). Call 512-232-5050 or visit http://www.utexas.edu/safety/bcal.
All students should be familiar with the University’s official e-mail student notification policy. It is the student’s responsibility to keep the University informed as to changes in his or her e-mail address. Students are expected to check e-mail on a frequent and regular basis in order to stay current with University-related communications, recognizing that certain communications may be time-critical. The complete text of this policy and instructions for updating your e-mail address are available at http://www.utexas.edu/its/policies/emailnotify.html.
Schedule is tentative and subject to change.
Please refer to the grading policy for a high level overview about the project and requirements.
Course projects are typically done by teams of 2 students. Projects done by 1 or 3 students are only possible with prior approval of the instructor.
Ideally, you pick a topic of your own that you are curious about. But to give you an idea of possible topics, here are a few pointers. Please discuss your topic with the instructor to make sure that it is both substantial and feasible.
How do people with different political affiliations talk about the same topic, do they use different words? To study this, you can use word association weights, clustering and topic modeling to identify themes in documents.
Themes in song lyrics for different genres: To study this, you can use word association weights, clustering and topic modeling to identify themes in documents
Author analysis: analyzing poems to detect who may have written them, and what characteristics they have
Language and ratings: What kinds of words are being used to describe, for example, cheap versus expensive wines?