-1

I'm looking for a more efficient way of loading text data into Python, instead of using .readlines(), then manually parsing through the data. My goal here is to run different models on the text.

My classifiers are People's names, which are listed before the text of their... let's call them 'Reviews'... which are separated by ***. Here is an example of the txt file:

Mike P, Review, December, 2013
Mike P, Review, June, 2013
Tom A, Review, December, 2013
Tom A, Review, June, 2013
Mark D, Review, December, 2013
Mark D, Review, June, 2012
Sally M, Review, December, 2011

***

This is Mike P's first review

***

This is Mike P's second review

***

This is Tom A's first review

***

Etc...

Ultimately, I need to create a bag of words from the 'Reviews'. I can do this in R, but I'm forcing myself to learn Python for data analysis and keep spinning my wheels every which way I turn.

Thanks in advance!

mrp
  • 689
  • 2
  • 11
  • 28
  • 2
    Perhaps you could give more information on how your bag of words is going to be structured? – John Powell Aug 05 '14 at 21:19
  • 1
    Regarding reading a file in Python [this Q&A](http://stackoverflow.com/questions/14676265/how-to-read-text-file-into-a-list-or-array-with-python) could be checked. – 030 Aug 05 '14 at 21:23
  • @JohnBarça, there's nothing _wrong_ with using readlines(), I'm just curious to know if there's a better (or best) way to get this data into Python. I'm going to create a term frequency matrix of the text in the 'Reviews'. So in tabular format, think of each row as a name (Mike P, Tom A, etc..), and the columns are the words from the reviews. – mrp Aug 05 '14 at 21:40
  • @utrecht, Thanks! I was able to use `lines = text_file.read().split('***')` which loaded each 'Review' into an element in a list. Everything before that is in `lines[0]` which shouldn't be too bad to parse through. – mrp Aug 05 '14 at 22:01
  • OK, I stand corrected. For very large files, readlines is a bad idea. – John Powell Aug 05 '14 at 22:13

4 Answers4

2

You are probably looking for something like the Counter collection which is a very efficient dictionary for counting hashable objects, such as words. See, How to read large file, line by line in python for an explanation of why readlines is not a good approach for large files, while the approach listed in the link, and below, turns the file into an iterable which is more memory efficient. You didn't specify your file sizes, but text analysis often deals with huge files, so that is probably worth mentioning.

Putting these two together, you could do something like this.

from collections import Counter
c=Counter()

with open('Reviews') as f:
    for line in f:
        for word in line.split(" "):
            c[word]+=1

EDIT: you might want to split on *** or something else, but this gives the general idea.

Community
  • 1
  • 1
John Powell
  • 12,253
  • 6
  • 59
  • 67
1

You can read the whole file with that, is a very efficient way.

with open('Path/to/file', 'r') as content_file:
    content = content_file.read()

Then you can parse content as you wish.

levi
  • 22,001
  • 7
  • 73
  • 74
  • Thanks for the reply, but I think this would make my job more difficult -- this is loading the data one character at a time. – mrp Aug 05 '14 at 21:32
0

If it's a big amount of data to read at once you can iterate manually via readline() and then parse it on the way dumping unnecessary entries.

lennon310
  • 12,503
  • 11
  • 43
  • 61
locohamster
  • 117
  • 6
  • Thanks. I'm finding the most common way people perform operations in Python is to just create some loop to iterate through the data, whereas with R there are a lot of functions that have already been created to do that work for you. Is this more or less accurate? Should I just get used to learning/doing things more 'manually' in Python? – mrp Aug 05 '14 at 21:53
  • 2
    No. Python has some extremely efficient ways of doing things -- look at numpy, scipy, scikit learn, matplotlib and a bunch of other libs for good examples. For what it is worth, I felt the same way when I was trying to learn R -- there is always an initial why isn't this more like my favourite language moment :D. – John Powell Aug 05 '14 at 22:00
0

If you can post the way you are doing it or thinking of doing it with R, I suspect someone could offer some suggestions as to how to do it with Python efficiently. For example, you can make a numpy array of strings and use functions in the numpy.char module to do vectorized operations over strings if you prefer that to writing list-comprehensions or for-loops.

Travis Oliphant
  • 1,745
  • 14
  • 10