.jpg)
Jean Jaques Rousseau
This evening I needed to get a list of unique words for a captcha tool. I decided to start with one of my favourite books of all time: 'Confessions' by Jean-Jaques Rousseau. I grabbed the text courtesy of the Gutenburg project and then had to figure how to extract the words. Rousseau was not only an eloquent writer but posessed a wide vocabulary (even when translated into english).
I googled for some scripts, but then just decided to pipe stuff together on the command line. here's how it looked:
grep -o '[a-zA-Z]\{4,12\}' rousseau.txt |
tr A-Z a-z |
sort -u |
tr '\n' ';' > rousseauWords.txt
this processed the whole book in less than 2 seconds and gave me a list of all uniques words longer than 3 characters and shorter than 13, stripped of punctuation, turned to lowercase and seperated by a colon.
Not particularly eloquent, and I'm sure there is some repetition but it does the job nicely