2

Problem Summary

  • I'm trying to load .txt files in Python using Pandas.
  • The .txt files uses | delimiter between fields
  • Each field is captured between double quotes "" as a string: e.g. "i_am_a_string"
  • The problem is some fields have apostrophes represented with double quotes. e.g. "I"m_not_a_valid_string" (it should be "I'm_not_a_valid_string")

Sample file

To demonstrate my issue I have created a test file which is as follows when edited in vi:

"Name"|"Surname"|"Address"|"Notes"^M     
"Angelo"|""|"Kenton Square 5"|"Note 1"^M  
"Angelo"|""|"Kenton’s ^M                 
Sqr5"|"note2"^M                          
"Angelo"|""|"Kenton"s ^M                   
Road"|"Note3"^M

Loading data

To load this file I run the following command in Jupyter notebook:

test = pd.read_csv('test.txt', sep ='|')

which loads up the file like the screenshot below:

enter image description here

Questions

There's 2 issues I'm looking to address represented in examples "note2" and and "Note3" in the file:

note2 question

How can I get rid of the ^M when loading the file? i.e. how can I remove the "\r\r\n" from the Address column when loaded up in Jupyter. The "note2" example should have loaded like this in the Address column:

enter image description here

  • Should I remove these before loading the file using bash commands or
  • Should I remove these after I load it in Jupyter using Python?
  • Can you please suggest the code to do it in each case and which one would you recommend (and why)?

Note3 question

How do I replace the double quote within the string expression with apostrophe? here it breaks it to another line which is incorrect. This should be loaded in row 2 as follows: enter image description here

"Note3" example is a compounded one as it also has the "^M" characters in the string but here I'm interested in replacing the double quotes with an apostrophe so it doesn't break it to another line corrupting the loading.

Thank you for your help, much appreciated.

Angelo

Angelo
  • 341
  • 1
  • 2
  • 11
  • Use `.str.replace()` for both tasks. – DYZ Sep 18 '21 at 05:58
  • THank you @DYZ. That would solve note2 issue but in Note3 it breaks to another line. How do I do that before I load the file? – Angelo Sep 18 '21 at 06:07
  • 1
    Check [this question](https://stackoverflow.com/questions/52366670/handle-unwanted-line-breaks-with-read-csv-in-pandas). – DYZ Sep 18 '21 at 06:17

1 Answers1

1

How do I replace the double quote within the string expression with apostrophe?

If " which are to be converted into ' are always between letters (word characters) you might preprocess your file using regular expression (re) following way

import re
txt = '''"Name"|"Surname"|"Address"|"Notes"
"Angelo"|""|"Kenton Square 5"|"Note 1"
"Angelo"|""|"Kenton’s   
Sqr5"|"note2"                    
"Angelo"|""|"Kenton"s
Road"|"Note3"'''
clean_text = re.sub(r'(?<=\w)"(?=\w)', "'", txt)
print(clean_text)

output

"Name"|"Surname"|"Address"|"Notes"
"Angelo"|""|"Kenton Square 5"|"Note 1"
"Angelo"|""|"Kenton’s   
Sqr5"|"note2"                    
"Angelo"|""|"Kenton's
Road"|"Note3"

Explanation: use zero-length assertion to find " which are after word character and before word character.

If you have text in file, firstly read it as text file i.e.

with open("test.txt","r") as f:
    txt = f.read()

then clean it

import re
clean_text = re.sub(r'(?<=\w)"(?=\w)', "'", txt)

then put it into pandas.DataFrame using io.StringIO as follows

import io
import pandas as pd
test = pd.read_csv(io.StringIO(clean_text), sep ='|')
Daweo
  • 31,313
  • 3
  • 12
  • 25
  • Thank you for your answer. This works if you define the text to be pre-processed in jupyter. In my case the text was part of a file e.g. Angelo.txt and by the time I load the file to a variable it's broken Note3 to another line due to this issue. So just thinking, should I be doing it in bash or some other code in jupyter before I use the pd.read_csv command? – Angelo Sep 18 '21 at 06:28