0

Okay, I have a CSV file with several lines (more than 40k currently). Due to the massive number of lines, I need to read one by one and do a sequence of operations. This is the first question. The second is: How to read the csv file and encode it to utf-8? Second is how to read the file in utf-8 following the example: csv documentation. Mesmo utilizando a classe class UTF8Recoder: o retorno no meu print é \xe9 s\xf3. Could someone help me solve this problem?

import preprocessing
import pymongo
import csv,codecs,cStringIO
from pymongo import MongoClient
from unicodedata import normalize
from preprocessing import PreProcessing

class UTF8Recoder:
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)
    def __iter__(self):
        return self
    def next(self):
        return self.reader.next().encode("utf-8")

class UnicodeReader:
    def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)
    def next(self):
        '''next() -> unicode
        This function reads and returns the next line as a Unicode string.
        '''
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]
    def __iter__(self):
        return self

with open('data/MyCSV.csv','rb') as csvfile:
    reader = UnicodeReader(csvfile)
    #writer = UnicodeWriter(fout,quoting=csv.QUOTE_ALL)
    for row in reader:
        print row

def status_processing(corpus):

    myCorpus = preprocessing.PreProcessing()
    myCorpus.text = corpus

    print "Starting..."
    myCorpus.initial_processing()
    print "Done."
    print "----------------------------"

Edit 1: The solution of Mr. S Ringne works. But now, I can not do the operations inside my def. Here's the new code:

for csvfile in pd.read_csv('data/AracajuAgoraNoticias_facebook_statuses.csv',encoding='utf-8',sep=',', header='infer',engine='c', chunksize=2):

    def status_processing(csvfile):

        myCorpus = preprocessing.PreProcessing()
        myCorpus.text = csvfile

        print "Fazendo o processo inicial..."
        myCorpus.initial_processing()
        print "Feito."
        print "----------------------------"

And at the end of the script:

def main():
    status_processing(csvfile)

main()

The output is when i use BeautifulSoup to remove links:

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Leandro Santos
  • 67
  • 1
  • 1
  • 10

2 Answers2

0

Here's a simple pattern to read line by line in UTF-8:

with open(filename, 'r', encoding="utf-8") as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in spamreader:
        # your operations go here
verbatross
  • 607
  • 1
  • 5
  • 10
0

you can store your csv in pandas and do further operations,which would be quicker.

import pandas as pd
df = pd.read_csv('path_to_file.csv',encoding='utf-8',header = 'infer',engine = 'c')
Shubham R
  • 7,382
  • 18
  • 53
  • 119
  • Hmm, but how to read line by line? In the case, I read a line, do the operations in the `def status_processing`and I go back and read another line. The process of correcting words is quite costly to read all at once and go doing the operations. – Leandro Santos Nov 11 '16 at 05:55
  • @LeandroS.Matos Use chunksize:for df in pd.read_csv('matrix.txt',sep=',', header = None, chunksize=1): – Shubham R Nov 11 '16 at 06:06
  • @LeandroS.Matos: http://stackoverflow.com/questions/29334463/pandas-read-csv-file-line-by-line – Shubham R Nov 11 '16 at 06:06
  • OK. I will try it. – Leandro Santos Nov 11 '16 at 06:46
  • @LeandroS.Matos and do upvote this answer which is a common practice around here,and mark it as correct,if it solves ur problem – Shubham R Nov 11 '16 at 06:48
  • It worked, but another problem arose. If you can read the Edit I made in the original question and try to help me I would be grateful. – Leandro Santos Nov 11 '16 at 14:15