0

I want to extract natural language from google 5grams with a key word. Then I need to clean the results from stop words (prepositions, pronouns etc.). Next I want to replace the ‘language’ results with a number. I have an excel file with a large corpora of words and corresponding scores for that. In the end I want to run a (two-sided repeated) ANOVA.

I have found this useful script from culturomics for python 2.x that does the first step. My input is "gemstone _NOUN" (wildcard function for nouns only). This input needs to be repeated to cover most other content words; i.e. "gemstone _VERB" "...* _ADJ" "...* _ADV". The output per input is a tsv file. In row 3 I have the keyword with the result and the linguistic word category. So I need to get rid off the keyword & word category and store all results it in an accessible manner for further processing. Store in an python array?

Another possibility is to use the concordance function from the NLTK package to retrieve the desired words. Then use the clean stop word function (which I was told exists) and replace words with numbers. But I haven’t pursued this option.

Before I continue I thought I’d ask. Is there another script available that I could leverage? Being new to Python, which approach is better?

I am looking to retrieve the results of 40 keywords, which gives me 200 words from google 5grams. Ideally I would like to adapt and apply the script for Twitter and other secondary data. Many thanks!

Simone
  • 497
  • 5
  • 19

1 Answers1

0

I will go with option A (tweak existing culturomics script) and/or Alvas suggestions. The concordance function only reads .txt and .xml files (so cannot actually read an URL input) and only allows for a single word input. This might be up-dated in future. There seems to be a graphical solution for multiple word input according to this discussion. I could certainly try to use the concordance crawler (haven't looked at it in-depth though) to gather the data, write the results to a compatible file and then start the analysis. But this adds another step in the script and I am not convinced about the use of that.

Community
  • 1
  • 1
Simone
  • 497
  • 5
  • 19