0

I have a text file as follows:

   Movie_names Rating
      "A"         10
      "B"         6.5

The text file is tab delimited. Some movie titles are enclosed in a double quote. How to read it into a pandas dataframe with the quotes removed from the movie names?

I tried using the following code:

import pandas as pd
data = pd.read_csv("movie.txt")

However, it says there is a Unicode decode error. What should be done?

Mainul Islam
  • 1,196
  • 3
  • 15
  • 21

3 Answers3

1

First you can read tab delimited files using either read_table or read_csv. The former uses tab delimiter by default, for the latter you need to specify it:

import pandas as pd
df = pd.read_csv('yourfile.txt', sep='\t')

Or:

import pandas as pd
df = pd.read_table('yourfile.txt')

If you are receiving encoding errors it is because read_table doesn't understand the text encoding of the file. You can solve this by specifying the encoding directly, for example for UTF8:

import pandas as pd
df = pd.read_table('yourfile.txt', encoding='utf8') 

If you file is using a different encoding, you will need to specify that instead.

mfitzp
  • 15,275
  • 7
  • 50
  • 70
0

First you'll want to import pandas

Df = pandas.read_csv("file.csv")

Get rid of double quotes with

Df2 = Df['columnwithquotes'].apply(lambda x: x.replace('"', ''))
Mpark
  • 11
  • 1
  • I get a whole range of errors. It ends with "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 2: invalid continuation byte" and it is NOT a file with a csv extension. It has a .txt extension. – Mainul Islam Oct 11 '16 at 21:04
  • I am using python 3 so that may be reason for the Unicode error. I believe csvreader has ability to read text files and covert to CSV first. – Mpark Oct 11 '16 at 21:19
0

You can use read_table as its quotechar parameter is set to '"' by default and will so remove the double quotes.

import pandas as pd
from io import StringIO

the_data = """
A   B   C   D
ABC 2016-6-9 0:00   95  "foo foo"
ABC 2016-6-10 0:00  0   "bar bar"
"""
df = pd.read_table(StringIO(the_data))
print(df)

#      A               B   C        D
# 0  ABC   2016-6-9 0:00  95  foo foo
# 1  ABC  2016-6-10 0:00   0  bar bar
Romain
  • 19,910
  • 6
  • 56
  • 65