18

Possible Duplicate:
Looking for dataset to test FULLTEXT style searches on

I am recently on to a project of Data Mining, for which I need 100 GB of plain text for testing. I am tired of searching the net the whole day. Someone please help me out by providing the links, where can I download such text files?

peterh
  • 11,875
  • 18
  • 85
  • 108
Sri
  • 568
  • 1
  • 6
  • 18
  • 2
    Are you trying download 100GB text file..... – jiten Feb 07 '12 at 07:31
  • Yep..! More than 100 GB actually.. 1TB is our target..! – Sri Feb 07 '12 at 07:39
  • Get the whole of gutenberg in one 7zip file: http://www.gutenberg-tar.com/ –  May 12 '16 at 20:41
  • 2
    This may also be handy: https://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/ both for the benefit of future searchers - I realise that this question is old ;) –  May 12 '16 at 20:43
  • 1
    This link takes you directly to wiki download page: https://dumps.wikimedia.org/enwiki/latest/ – deldev May 25 '17 at 19:35

2 Answers2

11

What type of text are you searching for? Conversational, articles, books - or a good spread of everything?

Project Gutenberg might be a good start: http://www.gutenberg.org/

Wikipedia also allows you to download an archive of articles: http://en.wikipedia.org/wiki/Wikipedia:Database_download

Jordan
  • 2,281
  • 1
  • 17
  • 24
  • Yep... anykind of text files is Okay... yes.. conversational, articles, documentaries, novels.. etc...! – Sri Feb 07 '12 at 07:36
  • Project Gutenberg would probably be your best bet, there are over 38,000 free books on there. Most of them can be downloaded as plain text files. – Jordan Feb 07 '12 at 07:41
  • Is there a better way...instead of downloading each text file one after the other.. can i get a zipped file whose size is of the order of 1GB?? – Sri Feb 07 '12 at 07:52
5

you should use http://dumps.wikimedia.org/

jiten
  • 5,128
  • 4
  • 44
  • 73
  • 1
    can u please provide me a specific link..! n i saw a zipped file of xml format thats around 230 GB. Heres the link.. http://en.wikipedia.org/wiki/Wikipedia:Database_download.. before downloading i would like to know what exactly is present inside it.. ps: we are looking for text files that has got some meaningful text... like conversations, documentaries, etc..! – Sri Feb 07 '12 at 08:07
  • it is actually dump file of the dump file of the Wikimedia.and generally it contain Wikipedia article in xml format.so u can check it.I think it should be helpful to you. – jiten Feb 07 '12 at 08:44