View on GitHub


1st Workshop on NLP for Internet Freedom

First Workshop on NLP for Internet Freedom

A workshop dedicated to NLP methods that potentially contribute (either positively or negatively) to the free flow of information on the Internet, or to our understanding of the issues that arise in this area.

NLP4IF Proceedings

Venue: COLING 2018, Santa Fe, NM, USA, August 20

Motivation According to the recent report produced by Freedom House (, an “independent watchdog organization dedicated to the expansion of freedom and democracy around the world”, Internet freedom declined in 2016 for the sixth consecutive year. 67% of all Internet users live in countries where criticism of the government, military, or ruling family are subject to censorship. Social media users face unprecedented penalties, as authorities in 38 countries made arrests based on social media posts over the past year. Globally, 27 percent of all internet users live in countries where people have been arrested for publishing, sharing, or merely “liking” content on Facebook. Governments are increasingly going after messaging apps like WhatsApp and Telegram, which can spread information quickly and securely. Various barriers exist to prevent citizens of a large number of countries to access information. Some involve infrastructural and economic barriers, others violations of user rights such as surveillance, privacy and repercussions for online speech and activities such as imprisonment, extralegal harassment or cyberattacks. Yet another area is limits on content, which involves legal regulations on content, technical filtering and blocking websites, (self-)censorship. Large internet providers are effective monopolies, and themselves have the power to use NLP techniques to control information flow. Users are suspended or banned, sometimes without human intervention, and with little opportunity for redress. Users react to this by using coded, oblique or metaphorical language, by taking steps to conceal their identity such as the use of multiple accounts, raising questions about who the real originating author of a post actually is. This workshop should bring together NLP researchers whose work contributes to the free flow of information on the Internet.
The workshop is supported by the U.S. National Science Foundation, award No. #1828199
Students can apply for travel grants. For more information, please, contact Anna Feldman (


The topics of interest include (but are not limited) to the following:

We hope that our workshop will promote Internet freedom in countries where accessing and sharing of information are strictly controlled by censorship.

Mailing list [Mailing list for the workshop](!forum/nlp4if)
Important dates workshop submission deadline: May 25, 2018 notification: June 20, 2018 camera-ready submission deadline: June 30, 2018 workshop date: August 20, 2018
Registration To register, please go to
Submission Guidelines Submissions should be written in English and anonymized with regard to the authors and/or their institution (no author-identifying information on the title page nor anywhere in the paper), including referencing style as usual. Authors should also ensure that identifying meta-information is removed from files submitted for review. Submissions must use the Word or LaTeX template files provided by COLING 2018 and conform to the format defined by the COLING 2018 style guidelines. * Long paper submission: up to 8 pages of content, plus 2 pages for references; final versions of long papers: one additional page: up to 9 pages with unlimited pages for references * Short paper submission: up to 4 pages of content, plus 2 pages for references; final version of short papers: up to 5 pages with unlimited pages for references PDF files must be submitted electronically via the [START submission system]( The recommended style files are [available from the COLING repository]( Double submission policy: Parallel submission to other meetings or publications are possible but must be immediately notified to the workshop contact person. If accepted, withdrawals are only possible within two days after notification.
Organizers and Program Committee ### Organizers: * Chris Brew, Computational Research Scientist, Digital Operatives: * Anna Feldman,Professor of Linguistics and Computer Science at Montclair State University. * Chris Leberknight,Associate Professor of Computer Science at Montclair State University. ### Program Committee - Joan Bachenko, Deception Discovery Technologies, NJ - Jedidiah Crandall, University of New Mexico, NM - Chaya Hiruncharoenvate, Mahasarakham University - Lifu Huang, Rensselaer Polytechnic Institute (RPI), NY - Zubin Jelveh, The University of Chicago - Judith Klavans, Columbia University, NY - Jeffrey Knockel, University of New Mexico, NM - Will Lowe, Princeton University - Rada Mihalcea, University of Michigan, Ann Arbor, MI - Prateek Mittal, Princeton University, NJ - Rishab Nithyanand, Data & Society, NY - Noah Smith, University of Washington - Thamar Solorio, University of Houston, TX - Mahmood Sharif, Carnegie Mellon University, PA - Evan Sultanik, Trail of Bits, NY - Svitlana Volkova, Pacific Northwest National Laboratory, WA - Brook Wu, NJIT, NJ

Workshop Program

Time Event
09:00–10:00 Invited talk: Jennifer Pan (Stanford University): How the Chinese Government Fabricates Social Media Posts for Strategic Distraction, Not Engaged Argument slides
10:00–10:30 The effect of information controls on developers in China: An analysis of censorship in Chinese open source projects (Jeffrey Knockel, Masashi Crete-Nishihata and Lotus Ruan) slides data
10:30–11:00 Coffee Break
11:00–12:00 Invited talk: Jed Crandall (University of New Mexico): How to Talk Dirty and Influence Machines slides
12:00–12:30 Linguistic Characteristics of Censorable Language on SinaWeibo (Kei Yin Ng, Anna Feldman, Jing Peng and Chris Leberknight) slides, data
12:30–02:00 Lunch
02:00–03:00 Invited Talk: Nancy Watzman (Dot Connector Studio): What do journalists really want from NLP researchers? How to help build trust in media and democracy by helping journalists make sense of big data slides
03:00–03:30 Creative Language Encoding under Censorship (Heng Ji and Kevin Knight) [slides available upon request]
03:30–04:00 Coffee Break
04:00–05:00 Panel: NLP and Disinformation (Moderator: Chris Brew; Panelists: Jed Crandall, Heng Ji, Veronica Perez-Rosas, Nancy Watzman)

Guest Speakers

Dr. Jennifer Pan (Stanford University, CA)

How the Chinese Government Fabricates Social Media Posts for Strategic Distraction, Not Engaged Argument The Chinese government has long been suspected of hiring as many as 2 million people to surreptitiously insert huge numbers of pseudonymous and other deceptive writings into the stream of real social media posts, as if they were the genuine opinions of ordinary people. Many academics, and most journalists and activists, claim that these so-called 50c party posts vociferously argue for the government’s side in political and policy debates. As we show, this is also true of most posts openly accused on social media of being 50c. Yet almost no systematic empirical evidence exists for this claim or, more importantly, for the Chinese regime’s strategic objective in pursuing this activity. In the first large-scale empirical analysis of this operation, we show how to identify the secretive authors of these posts, the posts written by them, and their content. In contrast to prior claims, the Chinese regime’s strategy is to avoid arguing with skeptics of the party and the government, and to not even discuss controversial issues. The goal of this massive secretive operation is instead to distract the public and change the subject, as most of these posts involve cheerleading for China, the revolutionary history of the Communist Party, or other symbols of the regime.

Dr. Jedidiah Crandall (University of New Mexico)

How to Talk Dirty and Influence Machines Prof. Jedidiah Crandall is an Associate Professor in the University of New Mexico Department of Computer Science. The principle that drives his research is this: those who censor and surveil the Internet should only be able to do so with full transparency. Towards this end his research group carries out Internet measurements, reverse engineering of applications, and social media analysis to shed light on censorship and surveillance around the world. Crandall is part of the Net Alerts project, a collaborative effort to protect at-risk populations (such as journalists and activists) on the Internet by educating them about the unique threats they face.

In the book, "How to Talk Dirty and Influence People," Lenny Bruce expounds on the idea that humans need to be able to communicate with words that draw on human experiences in order to communicate some ideas effectively. This suggests a lot of euphemism, vernacular, and, yes, dirty words. In this talk I'll give an NLP research outsider's perspective on the problem of understanding what humans are saying online algorithmically. I'll use real examples of posts and keywords that trigger censorship or surveillance in China to illustrate challenging problems in this space. For example, how can we do topical analysis on language that is deliberately obfuscated? China's netizens have affectionately coined the term "Martian Language" to refer to how they communicate online, and my research group has proposed pointillism as an approach to apply topical analysis in this environment. I'll also propose some even more challenging problems, like can we take hundreds of keyword blacklists containing hundreds of thousands of keywords---many of which return 0 hits on Google---and categorize them? (The answer is maybe).

Nancy Watzman (Dot Connector Studio)

What do journalists really want from NLP researchers? How to help build trust in media and democracy by helping journalists make sense of big data Nancy Watzman is an award-winning investigative journalist, researcher, and strategist with a focus on launching data-rich journalism projects on emerging platforms. She has more than two decades of experience doing research, writing, strategy, communications, and policy analysis. Her reporting and commentary has appeared in many leading publications, including Harper’s Magazine, The Nation, The New Republic, USA Today, The Washington Monthly, and she has appeared on NPR, Fox News, and C-SPAN, and other networks. She’s also worked with watchdog and other public interest groups, including the Internet Archive, Sunlight Foundation, Common Cause, Center for Public Integrity, Center for Responsive Politics, Public Campaign (now Every Voice), and Public Citizen. In 2016 she managed the launch of the Political TV Ad Archive.

Trust in democracy generally and media particularly is at or close to all-time lows, according to established surveys. A combination of factors, including the rise of social media platforms and the decimation of traditional news profit-making models have contributed to the current state of affairs. The spread of mis- and dis-information is rampant, with various types of actors exploiting vulnerabilities in the new ways that people are communicating. The good news is that journalists are becoming increasingly sophisticated in the questions they are asking of big data to help in investigative, fact-checking, and other projects. A number of innovative experiments, many sponsored by philanthropy, are underway. At the same time, with layoffs continuing to ripple through newsrooms, varying technical skills, and thorny problems to solve, there’s much to be done. This talk will give a general overview of several ongoing projects and describe areas where journalists need help. ion of 2016 political ads with underlying, downloadable data on airings in key TV markets. She is co-author, with Micah Sifry, of Is That a Politician in Your Pocket? Washington on $2 Million
 a Day (John Wiley & Sons, 2004), and contributed to The Buying of the Congress, Center for Public Integrity (Avon Books, 1998). She was the recipient of the Century Fund grant and served as fellow for the Center for Independent Media. She is a graduate of Swarthmore College, majoring in History and English Literature.