Can we analyze word associations in online solicitation texts?
Today I stumbled upon a free, open-source online text analysis tool called Overview; a collaboration between the Associated Press and the Knight Foundation. This tool is designed to allow for the mining of large sets of documents according to topics and to provide a visualization of broad trends and patterns. They are the kind of tools that are often used by journalists or political analysts whose roles require them to trawl through large numbers of lengthy documents to broadly classify and interpret their contents. Intrigued, I thought, “I wonder what would happen if I uploaded some child sexual solicitation transcripts in there…?” Having conducted research on the topic before, and knowing that there were publicly-available examples of such material, I decided to give it a bash.
[Warning: this post contains an overabundance of profanity.]
— overviewproject (@overviewproject) July 18, 2014
Be aware that this is not a formal research project; I was just interested to see what this kind of tool might do with the data and what the output would look like, and I felt that it would be interesting to share my experiences with the readers of nextgenforensic. Who knows, maybe someone will develop a fabulous paradigm-shifting methodology based on this foray (I’m looking at you, grad students in search of data/research projects.) Also, this should not be read as an endorsement of this product. I know that other similar tools for text analysis exist and I have no idea of the relative merits of each of them. But this one was free.
For my little demo, I used the top-thirty ‘slimiest’ adult male transcripts from the website of Perverted Justice – the guys from Dateline NBC’s To Catch A Predator series. All averaged sliminess scores, as judged by visitors to the site, of 4.6 or above (4 being ‘Really slimy’ and 5 ‘Simply beyond slimy’). I removed any commentary and additional text from the transcripts and uploaded them as separate rich text format files. It’s worth noting that uploaded documents are only accessible by you via your log-in details, but it’s recommended that for sensitive documents you download and run the application from your own computer.
This is neither the time nor the place to discuss the issues related to Perverted Justice, their practices, and their data. I know these transcripts lack validity due to the use of decoys; I understand that these decoys don’t necessarily replicate the actual behavior/language of children; I know their role is to entice suspects into publishing incriminating evidence; I am aware that they are considered by some to be a vigilante group. But I wanted transcripts and theirs are full and freely available.
Overview can be asked to ignore certain words, so I excluded the user names of the individuals (which appear on every line). It also allows for ‘important’ words to be given extra weight in the analysis’ algorithm. So I hastily put together a basic, far-from-exhaustive list of sexual words from various online sources. It should be noted that the absolute importance of these words will have been limited by the mix of SMS language, teen slang, and bloody awful spelling/grammar on display. Also, the analysis is case-sensitive and I only included lower-case examples. The list is available here.
Then, I pressed the ‘Go’ button. Actually it’s labeled ‘Import documents’, but that doesn’t really help to weave this tale for the reader. I’m led to believe by the creators of Overview that what happens next is that the software examines every word (exempting ‘stop words‘) within each of the documents and assigns them to ‘folders’ based on their similarity; where similarity is judged by the relative frequencies of the words contained between each document. Then, within each super-ordinate folder, it creates smaller folders based on sub-themes. The output tells you, within each folder, which words are contained in ALL, MOST, or SOME of the files. It also allows you to search documents for specific words or combination of words, and supports the use of Boolean operators (e.g., “nude AND (pics OR pic)”).
My output gave me a top level that appeared to outlined the key words – those that appeared frequently across all of the documents (e.g., MOST: baby, pics, cock, pussy, yea; SOME: dick, nude, gay, wat, sexy).
The second- and third-level clusters, however, were less decipherable in terms of identifying any meaningful explanation for why those terms might cluster together. Two ‘folders’ were created – one with 25 documents (e.g., MOST: baby, pics, cock; SOME: gay, nude, sexy, dick, idk, rape) and one with 5 documents (e.g., MOST: girl, pussy; SOME: boyfriend, naked, dick, cauze, wood, yea). One could hypothesize that the former appears to contain documents with more male-oriented words (cock, gay [most chatters being male], dick) and the latter more female-oriented words (girl, pussy, boyfriend [if you’ll excuse my heteronormativity]); but, blimey, even the most lackadaisical qualitative researcher might balk at that kind of superficiality.
Digging deeper into that cluster did nevertheless promise some clues to possible underlying meanings. Arguably, the five folders that stem from the super-ordinate theme I called ‘male-oriented’ above, appear to contain documents that include the following linguistic themes (number of documents within each folder in parentheses): (8) sub-dom/aggressive – e.g., pics, ass, master, rape, naughty, etc.; (5) masturbatory/descriptive – e.g., cock, rub, nude, squirt, gay, etc.; (4) female/intimate – e.g., baby, finger, kisses, cutie, honey, okay, vagina, etc.; (4) pregnancy – e.g., baby, condom, pregnant, dick, etc.; and (4) romantic/non-aggressive – e.g., love, online, sexy, dick, underwear, toy, tickle, etc. But again, this is at the most superficial level of interpretation.
Ultimately, this is a tool designed to analyze thousands of documents at a time. Even if words in these conversations do cluster in meaningful ways, it’s unlikely that a sample of thirty documents would have the statistical power to tap into those associations. It does, however, give a glimpse into the associations between some sexual words that are frequent across transcripts. Aside from beefing up the number of documents in the set, my next trick will be to try other sets of important words – for example, risk assessment words (e.g., parents, father, mother, police, cops, legal, etc.) or contact words (meet, location, address, place, etc.)
If tools like this were found to be a valid method by which to understand the content of online sexual child solicitation transcripts, they may have the potential to free-up resources that I would argue are better applied to understanding process and dynamic factors in these conversational dyads (like the old adage goes, “It’s not what you say, it’s the way you say it.”)
Still, this was an interesting endeavor!
Elliott, I. A. (2014, July 18). Can we analyze word associations in online solicitation texts? [Weblog post]. Retrieved from http://wp.me/p2RS15-5B.
Want to submit your own post? Click here to find out how!