-1

I am very new to Python. I am trying to read my text file using python Data Science library Pandas. But I get an error of Unicode which I don't understand.If you could help me then it would be very beneficial to me. I am uploading my code here:

import pandas as pd
text = pd.read_csv("/home/system/Documents/Heena/NLP/modi.txt", sep = " ", header = None)

Error Code:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/system/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/system/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 446, in _read
    data = parser.read(nrows)
  File "/home/system/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "/home/system/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 62 fields in line 7, saw 67
Rory Daulton
  • 21,934
  • 6
  • 42
  • 50
Heena
  • 61
  • 2
  • 10
  • 1
    Can you give some example datas in your modi.txt? – qxang Dec 24 '18 at 14:08
  • I guess, you have a problem with the data. Therefore carefully review the TXT file – Ali.Turkkan Dec 24 '18 at 14:09
  • i think its because of uneven number of spaces in the text..thats why its not getting converted to df – iamklaus Dec 24 '18 at 14:27
  • Data has unwanted space. The data as follows: My dear countrymen, I convey my best wishes to all of you on this auspicious occasion of Independence Day. Today, the country is brimming with self-confidence. The country is scaling new heights by working extremely hard, with a resolve to realize its dreams. Todays dawn has brought a new spirit, a new enthusiasm, a new zeal and a new energy with it. My dear countrymen, in our country, there is a Neelakurinji flower which blooms once every 12 years. – Heena Dec 25 '18 at 03:46
  • What would be the solution? Please help me. – Heena Dec 25 '18 at 04:49

1 Answers1

0

Because the data inside a space character, CVS perceives this as a different column. As a solution to this, separate the data with a different character. Then make the sep value this character. Example;

test.csv

data1;data2;data3
My dear countrymen;12;test data1
I convey my best wishes to all of you on this auspicious occasion of Independence Day.;45;test data2

test.py

import pandas as pd
text = pd.read_csv("test.csv", sep = ";")

You can also look at this answer

Ali.Turkkan
  • 266
  • 2
  • 11
  • My file is text file and it has large amount of data. I have also tried with read_fwf() function to read text file. Using this function I am able to read file but when I apply Natural Language Processing functions that time error occurs. i.e. for match in self._lang_vars.period_context_re().finditer(text): TypeError: expected string or bytes-like object – Heena Dec 26 '18 at 04:35