Reading text file using pandas using python

Question

I am very new to Python. I am trying to read my text file using python Data Science library Pandas. But I get an error of Unicode which I don't understand.If you could help me then it would be very beneficial to me. I am uploading my code here:

import pandas as pd
text = pd.read_csv("/home/system/Documents/Heena/NLP/modi.txt", sep = " ", header = None)

Error Code:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/system/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/system/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 446, in _read
    data = parser.read(nrows)
  File "/home/system/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "/home/system/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 62 fields in line 7, saw 67

I guess, you have a problem with the data. Therefore carefully review the TXT file — Ali.Turkkan, Dec 24 '18 at 14:09
i think its because of uneven number of spaces in the text..thats why its not getting converted to df — iamklaus, Dec 24 '18 at 14:27
Data has unwanted space. The data as follows: My dear countrymen, I convey my best wishes to all of you on this auspicious occasion of Independence Day. Today, the country is brimming with self-confidence. The country is scaling new heights by working extremely hard, with a resolve to realize its dreams. Todays dawn has brought a new spirit, a new enthusiasm, a new zeal and a new energy with it. My dear countrymen, in our country, there is a Neelakurinji flower which blooms once every 12 years. — Heena, Dec 25 '18 at 03:46

score 0 · Answer 1 · answered Dec 25 '18 at 15:43

0

Because the data inside a space character, CVS perceives this as a different column. As a solution to this, separate the data with a different character. Then make the sep value this character. Example;

test.csv

data1;data2;data3
My dear countrymen;12;test data1
I convey my best wishes to all of you on this auspicious occasion of Independence Day.;45;test data2

test.py

import pandas as pd
text = pd.read_csv("test.csv", sep = ";")

You can also look at this answer

answered Dec 25 '18 at 15:43

Ali.Turkkan

266
2
11

My file is text file and it has large amount of data. I have also tried with read_fwf() function to read text file. Using this function I am able to read file but when I apply Natural Language Processing functions that time error occurs. i.e. for match in self._lang_vars.period_context_re().finditer(text): TypeError: expected string or bytes-like object – Heena Dec 26 '18 at 04:35

Reading text file using pandas using python

1 Answers1

test.csv

test.py