Machinations


Censorship
October 8, 2011, 1:21 am
Filed under: Uncategorized | Tags: , ,

A recent press release gives some information on how censorship is performed in the Chinese version of Skype;  the results are discussed in more detail in a paper by Jeffrey Knockels, Jed Crandall and myself.  Currently, a significant portion of Internet censorship is keyword based: any content that contains a keyword that is on some blacklist is censored.  Countries that perform keyword censorship generally try to hide these blacklists, probably both for political reasons and to make it difficult to evade censorship by using neologisms: new words that have the same meaning but are not on the blacklist.

In some peer-to-peer applications, this censorship is done on the client side, so there is a subroutine in, e.g. Skype chat, that checks if an outgoing message contains a keyword on the blacklist.  If you are the censor and you don’t care about revealing the blacklist, then there are techniques for doing this in an efficient manner (hint: FSA’s).  However, it’s an interesting (but evil) theoretical problem to think about how to efficiently do keyword censorship if you are also trying to also hide the blacklist.  In particular, if you want to hide it from someone who may be running your executable in a debugger.  Hint: the Chinese Skype program did it incorrectly and that’s why we were able to decrypt their blacklist!

Advertisements

7 Comments so far
Leave a comment

[spoiler alert]

Well, you do it by cryptographic hashing, of course. (The hash function doesn’t have to be very secure, I guess; Any other one way function type thing would work). Pretty ingenious, I never thought of that application of hashing.

Comment by Elad

Elad,

You’re right that a cryptographic hash function will hide the blacklist. However, there is still a question of scalability that I don’t know the answer to:

The blacklist can get rather large, particularly if the censor needs to keep adding neologisms for words – for example there are many ingenious neologisms for June 4th (Tiananmen square protest date) and falun gong. The naive cryptographic hashing approach has a runtime that depends on the product of the message and black list size. Is this runtime unavoidable in order to ensure secrecy of the blacklist?

Note that If you use a finite state automata to test if a message contains any keyword in the blacklist, then the runtime of your algorithm is linear in the message size, and doesn’t depend on the size of the blacklist. However, this approach of course can reveal info about the blacklist…

Comment by Jared

you can always run words through one at a time from a dictionary and see what gets through and what doesn’t. then you know what’s in the blacklist. maybe i’m missing something here.

s.

Comment by steve uurtamo

This works for words that are in a dictionary. However, many phrases, names, and urls that are on the blacklist are not listed in any dictionary. Also neologisms won’t appear in a dictionary – for example, “flg”, which is a neologism for Falun Gong, is on the blacklist, but not in any dictionary.

Take a look at the blacklist below and you’ll see what I mean:
http://cs.unm.edu/~jeffk/tom-skype/dlist-3.6/list

Comment by Jared

i’ll check it out. however, these kinds of phrases and words are generally very easily scraped from public forums. back in the old days (for password checking purposes), i scraped usenet pretty heavily just to get common misspellings, etc. checking is (presumably) quite fast.

s.

Comment by steve uurtamo

There is a research community that focuses on reverse engineer these blacklists (see e.g. http://www.conceptdoppler.org/faq.html), and I think that scraping phrases from public forums is one of the techniques they use. However, for various reasons, this does not seem to capture entire blacklists. To mention a couple that occur to me: 1) some of the phrases on the blacklists are 8-10 words long (i.e. detailed location names for jasmine revolution protest sites); and 2) the act of censorship may make some of the phrases on the blacklist unlikely to occur on easily accessible public forums.

Jared

Comment by Jared

jared — wow! your points are both excellent.

s.

Comment by steve uurtamo




Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s



%d bloggers like this: