clean a UTF-16 text file using Pandas Dataframe / Python

Question

I have a text file with UTF-16 encoding which looks like

"Day;""name"";""age"";""money"""
"2021-04-01;""mona"";""27"";"""""
"2021-04-02;""Jan"";""29"";"""""
"2021-04-03;""Shan"";""25"";"""""

I want to convert it into pandas dataframe, based on one another answer I tried this

# Just manually make the columns
cols = ['Day', 'name', 'age', 'money']
df = pd.DataFrame(columns=cols)

#df
#Empty DataFrame
#Columns: [Day, name, age, money]
#Index: []

# Read the CSV in to a list
import csv
data = []
with open('test.csv', 'r') as f:
    reader = csv.reader(f)
    # Skip the header since we already have column names
    next(reader)
    for row in reader:
        # List comprehension to edit the strings
        edited = [s.replace('"', '').split(';') for s in row][0]
        data.append(edited)
    
# Use our column names to make a new dataframe
df2 = df.append(pd.DataFrame(data, columns=cols))

but this doesn't work and we get error in line (The idea was to trim all the " from the .txt file and then read it in a dataframe with a delimiter ';')

edited = [s.replace('"', '').split(';') for s in row][0]

----list index out of range---

The reason is that if I print the rows from above code we get

['2021-04-01;"pqr";"15";""'] ['2021-04-02;"pqrs";"152";""'] ['2021-04-03;"pqrt";"152";""']

so basically all the columns are now inside just 1 column and above code to split isn't working.

my requested solution should look something like

    Day         name  age   money
0   2021-04-01  mona   27    NaN
1   2021-04-02  Jan    29    NaN
2   2021-04-03  Shan   25    NaN

Can anyone help me please, Thanks in advance.

SCKU · Accepted Answer · 2021-06-18T13:07:39.460

2

Yup, it would be tricky because ; looks like within " "...

You may want to replace all " to , then use io.StringIO with pd.read_csv to get the dataframe split it by ;:

from io import StringIO as sio
import pandas as pd

your_file = sio(
    '''"Day;""name"";""age"";""money"""
"2021-04-01;""mona"";""27"";"""""
"2021-04-02;""Jan"";""29"";"""""
"2021-04-03;""Shan"";""25"";"""""
'''
)

read_all_once = your_file.read()
read_all_once = read_all_once.replace('"', "")

file_like = sio(read_all_once)
df = pd.read_csv(file_like, sep=';')

df:
    Day         name    age money
0   2021-04-01  mona    27  NaN
1   2021-04-02  Jan     29  NaN
2   2021-04-03  Shan    25  NaN

edited Jun 18 '21 at 13:07

answered Jun 18 '21 at 13:04

SCKU

783
9
14

but the text file is in UTF-16 format so that might be an issue with sime .read(), no? – sdave Jun 18 '21 at 13:05
2

https://stackoverflow.com/questions/19328874/python-read-text-file-with-weird-utf-16-format this conversion should be done before. – Pvria Ansari Jun 18 '21 at 13:08
1

like @PouriaAnsari says, although it's another questions, you may use `io.open('data.txt','r', encoding='utf-16-le')` to get the "python unicode string", then do whatever you want. – SCKU Jun 18 '21 at 13:12

clean a UTF-16 text file using Pandas Dataframe / Python

1 Answers1