|Professor:||Jonathan North Washington|
|Email:||jwashin1@swarthno scrapers please...more.edu|
|Lecture Time:||TTh 11:20am-12:35pm|
|Course moodle site:||S18 - LING073.01|
This course is designed to give you an understanding of the main concepts in the field of Computational Linguistics (as distinguished from Natural Language Processing), and impart the skills needed to solve the types of problems encountered in this field. Here's the official description:
This course explores the possibilities for creating computational resources for languages for which vast collections of text don't exist. Students will choose a language lacking in computational resources and develop tools for it. The focus will be on creating nuanced symbolic representations of the language that can be employed by computers, to the benefit of both language researchers who wish to test grammatical models, and language communities which lack the social capital to benefit from corporately developed resources. Topics covered include input methods and spell-checking, morphological analysis and disambiguation, syntactic parsing, building corpora, and rule-based machine translation, with an emphasis on open source technologies.
Prerequisites: LING 001 (or equivalent), or CPSC 021 (or equivalent), or permission of the instructor.
The primary goal of the course is for students to choose a language lacking in computational resources and develop tools for it. Additionally, students will
The general structure of the course will be centred around student projects. At the beginning of the course, each student will choose an under-resourced language to work on (in consultation with the professor), and will spend the semester developing materials for the language as lab assignments. In general, we will spend two days on each topic (Thursday and Tuesday), where the first day (Thursday) will be more focussed on discussing the topic (overview, general issues and solutions, etc.) and the second day (Tuesday) will be mostly dedicated to guided lab work on the problem. The week's lab assignment will be due the following class day (Thursday) before class begins, so Tuesday provides an opportunity to get started on the lab and get assistance from the professor on difficult areas.
This course has a prerequisite of LING 001 (or equivalent), or CPSC 021 (or equivalent), or the permission of the instructor. Any background beyond the introductory level in either computer science or linguistics (or both!) will give students an advantage, but nothing beyond a previous intro to at least one of them is necessary. All required skills will be imparted throughout the course. There will be no real programming required of students, but we will be using command-line tools and several different types of declarative syntax. No previous knowledge of linguistics is required for students with CS background, but a focus of the course will be coming to understand linguistic phenomena by implementing models of them computationally. The challenge of the course is less about learning the computational formalisms or understanding the patterns in the languages from a formal linguistics perspective (though skills in both fields will be strengthened by the course), and more in learning to use the formalisms to implement models of the linguistic patterns computationally. It is expected that some students will grasp these different aspects of the course with different levels of ease, which provides a great opportunity for students to share knowledge and skills with one another.
No textbook is required, but you will need to have access to the following resources:
We will be using Swarthmore GitHub (github.swarthmore.edu) for a number of purposes. Most assignments will be submitted by a script that automatically clones the relevant repo as class begins.
Also, you'll need to be able to access Moodle (moodle.swarthmore.edu). Some materials we use for the course will be available there (readings, etc.), as will your grades, so make sure you can access it as soon as possible. If you have any trouble with it, notify me as soon as you can. Non-Swarthmore TriCo students may not have access to Moodle immediately at the beginning of the semester—let me know if this is the case for you, and I will make sure you have access to resources in some other way.
The course website (listed above, and linked to from the Moodle course) contains the schedule for the semester, which will be updated regularly with links to various resources. It's recommended to check the website at least a couple times per a week. I will make announcements about any major changes.
The course wiki (wikis.swarthmore.edu/ling073) will be where we organise the resources developed in the course. It is there not just for your professor and classmates, but for anyone in the world to access, and so may end up attracting the attention of speakers of the languages or other people interested in them. It will also be a model for future students of this course to look to.
I hold regular office hours (listed above), and can be available at other times by appointment—just send me an e-mail letting me know when you might prefer to meet.
If you are having any trouble with class, such as with understanding a concept or completing an assignment, please don't hesitate to ask me for help. I'm here to help you learn, so I encourage you to take advantage of my availability.
Show up on time and silence cell phones. Food and drinks are generally not allowed in lab, per the policies for the room. However, I don't mind as long as you don't damage the equipment or disturb your classmates. If you need to step out of the class for any reason (bathroom, emergency phone call, etc.), please do so with minimum disruption (i.e., don't ask for permission).
Due to the nature of the course, we will be using computers in almost every class. This brings about the potential for a number of distractions, so please use the computer only for relevant classroom activities. In other words, please refrain from any sort of non-class-related activities, including messaging (e-mail, social media, etc.), homework for other courses, or even catching up on course reading. Even the best multitaskers are still not participating fully when they're engaging in unrelated endeavours. If it's too difficult to avoid the temptation of these other distractions, you may try strategies like disabling the computer's internet connection, using a filter for web usage, or similar.
Note on pronouns: if you'd like to be referred to by a pronoun that you think I might not guess correctly or if you notice me referring to you by some other pronoun than what you'd prefer, please let me know so that I can get it right.
All material covered during course-related activities—including assigned readings, quizzes, and labs—should be assumed to be required course content, and will be assumed background for later activities. It is each student's responsibility to attend all classes to learn the material covered. If you must miss a class (e.g., for an athletic or religious reason), it is courteous to notify your professor ahead of time if at all possible, but it will be your responsibility to learn about missed material from classmates. It is not my responsibility to make up for your absence or re-teach the material. (That said, let me know if you're having trouble making something up, and we'll figure something out.) With so few class meetings dedicated to each topic and the cumulative nature of the topics, missing one day can be a very big deal—so I really recommend trying not to miss class.
The assigned readings are to be read in advance of the class dates they're assigned for. The readings complement in-class activities and provide the necessary background; however, you should not assume that they will be fully summarized or reviewed in class. Students should be prepared to evaluate, integrate, or respond to the readings in class discussions.
Any excuse for missing any course-related activities will need to be handled by your class dean. Please see the Medical Excuse Policy (http://www.swarthmore.edu/student-health/medical-excuse-policy), and remember to contact your class dean as soon as you can so that they can work with you.
Assignments will generally be due at the beginning of class on Thursdays. Work on the assignment must be complete in order to move on to the next topic, so it is essential that assignments be submitted on time.
You will submit assignments almost exclusively on github and the course wiki (each assignment will say explicitly how to submit it), both of which keep timestamps. These two methods also both allow for incremental submissions, so you may often commit and push (github) and save your work (wiki) as you work on it. This means both that I can see exactly what was there at the deadline, but also that partial work may be there as of the deadline.
Any work submitted between the deadline and when the assignments are graded (usually not before the next day) will receive only half credit—e.g., if you submit about 75% of the assignment before the deadline and 100% of the assignment is there when it is graded, you can at most receive 87.5% on the assignment.
Using words or ideas from another source without attribution constitutes plagiarism, and misrepresenting another student's work as your own (or allowing another student to misrepresent your work as their own) is cheating. Please see the student handbook for the College's policies on academic misconduct (http://www.swarthmore.edu/student-handbook/academic-policies#academic_misconduct). Suspected cases of academic misconduct will be pursued to the full extent of College policy, including referral to the College Judicial Committee.
You are always expected to do your own work on assignments. You may (and are encouraged to) ask one another for and provide one another with assistance on assignments. If you are providing assistance, you must not provide the solution—you may only provide guidance that will help the other student(s) find the solution on their own. If the work in this course were a real-world FOSS project, providing the solution would be okay (and perhaps even encouraged), but the requirement that each student be evaluated on their own work is incompatible with this model (at least on the surface).
With every assignment you should include an AUTHORS file in the top directory, just like you might find in an open source project. If you receive assistance on any assignment from anywhere (a classmate, a website, a native speaker of the language, a stranger on the internet, etc.), please acknowledge them in the AUTHORS file.
In some instances you may work with your classmates. For lab assignments where you are working in a group or with a partner, you may divide the work as appropriate, within the parameters of the assignment. You may also discuss generalities of lab assignments with your classmets, such as what is expected from you. And of course, any discussion of course materials is strongly encouraged.
In short, submitting work that is not your own or providing a classmate with a solution will be considered academic misconduct and will be addressed as such (see above-mentioned policies). So please just be honest. And if you have any questions about what's considered acceptable, ask me first.
The grade in this course is broken down into the following components. Each component is expounded upon following the table.
Lab assignments will be due nearly every week of class. Each assignment will be a new tool (or analysis) for the language you are working with throughout the course.
Usually at least one class session will be dedicated to working on the assignment, so you can get a head start on it, and work through any problems that might come up during the assignment.
Some labs may not be entirely applicable to all languages; these labs will include an alternate option, with data for another language provided. You may only submit this alternate assignment for credit if you've consulted with the professor first. It's your responsibility to start each assignment early enough to consult with the professor in time to do the alternative assignment if your language will not work for the assignment. Such assignments will make it clear what's necessary for you to identify in the language, and the professor is available to help you figure out how your language fits the requirements.
Your midterm demonstration will be a short presentation clearly outlining what you have developed so far in the semester, how well it performs, and some examples of one or two issues unique to your language that you find particularly interesting (whether solved yet or not). The amount of time available for this presentation will be announced ahead of time and will depend on how many students are in the class—it will probably be very short (on the order of a couple of minutes). You'll be expected to use the time efficiently and not go over. A short question-answer section may also be included.
The final project will expand your work throughout the semester into one final domain, to be chosen from among the topics discussed over the last days of class, or another topic relevant to the course.
You should consider ahead of time what you might be interested in—that may be either interesting to work on or useful for a language community—and speak with the professor about how to approach the problem. Several options will be provided which include some guidelines for how to complete them; there will be some options both for those who are less technologically adventurous but are willing to do difficult work with a language and for those who are more technologically capable but not as interested in doing linguistic analyses. If you're not sure what might be a good idea given your background and strengths in the course up to that point, please talk with the professor.
If the project involves a translation pair, then you may collaborate in groups as you did when working on translation pairs for lab assignments. Your project should include, among other things, an evaluation component (i.e., test how well what you did works), and should be released publically with an open source license (even if not fully useful [yet]). During our final exam time, you will give a short presentation on your project, much like the midterm demonstrations. More information on the project will be provided later in the semester.
I do not grade on attendance, but you will be graded on engagement in the class, and this requires attendance. Beyond simply showing up and participating, you're encouraged to contribute to discussions by asking questions, answering questions, making relevant comments, helping classmates with in-class activities, etc. You will not be ridiculed for asking even simple questions—I want to make sure everyone grasps the concepts, and many are not as straightforward as they may first seem (or as I think they are). You are also expected to have read any assigned readings before class.
You are encouraged to engage in relevant discussion electronically as well—e.g., via the General Discussion forum on Moodle or in issues posted on GitHub. The course will also have an IRC channel, which you're encouraged to be logged into when you can. This is a good way to get help from your classmates (and your professor!) outside of class. Just be sure not to share solutions to assignments!
If you believe that you need accommodations for a disability, please contact Leslie Hempling in the Office of Student Disability Services (Parrish 113) or email firstname.lastname@example.org to arrange an appointment to discuss your needs. As appropriate, she will issue students with documented disabilities a formal Accommodations Letter. Since accommodations require early planning and are not retroactive, please contact her as soon as possible. For details about the accommodations process, visit the Student Disability Service Website at http://www.swarthmore.edu/academic-advising-support/welcome-to-student-disability-service. You are also welcome to contact me privately to discuss your academic needs. However, all disability-related accommodations must be arranged through the Office of Student Disability Services.
|week||date||topic||due / to read|
|1||23 Jan|| |
What (and why) is CL (and NLP)?
|25 Jan|| |
Materials for communities
Models of development, FOSS
|2||30 Jan|| |
lab 0 - documentation of resources
|1 Feb|| |
lab 1 - keyboard layout (due on Friday)
|3||6 Feb|| |
|8 Feb|| |
lab 2 - Initial corpus assembly
|4||13 Feb|| |
|15 Feb|| |
FSTs and morphology
lab 3 - Grammar documentation
|5||20 Feb|| |
FSTs and morphology
|22 Feb|| |
FSTs and phonology
lab 4 - Basic morphological analyser
|6||27 Feb|| |
FSTs and phonology
|1 Mar|| |
lab 5 - Basic morphological generator
|7||6 Mar|| |
|8 Mar|| |
RBMT and contrastive grammars
lab 6 - Basic CG disambiguator
|13 Mar||Spring break!|
|15 Mar||Spring break!|
|8||20 Mar|| |
|22 Mar|| |
lab 7 - Contrastive grammar
|9||27 Mar|| |
|29 Mar|| |
lab 8 - Lexical transfer
|10||3 Apr|| |
|5 Apr|| |
lab 9 - Lexical selection
|11||10 Apr|| |
|12 Apr|| |
lab 10 - Structural transfer
|12||17 Apr|| |
|19 Apr|| |
lab 11 - Polished basic RBMT system
|13||24 Apr|| |
lab 12 - Dependency corpus/grammar/parser
|26 Apr|| |
Other MT technologies
Dependency transfer, phrase-structure transfer, SMT and hybrid MT, neural MT, ...
|14||1 May|| |
Other CL technologies, NLP
bag-of-words fun, ASR, text-to-speech, ...
|3 May|| |
Other applications for FSTs
Spell checkers, paradigm generators, ...