I am stuck since 1day with a weird problem. I have a CSV file which I need to import into my hive table. The CSV file, however, has newline characters embedded in between the strings. As the files are huge I am not able to use a text editor to replace the '\n' character.
I wrote a python program to help me clean the file. I read each row from the CSV file and if I encounter any newline character I replace it with space. Below is my program.
# -*- coding: utf-8 -*-
import csv
import sys
file = open("team_contacts_cleaned.csv","w")
with open('team_contacts.csv') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
stripped = [col.replace('\n', '') for col in row]
file.write(','.join(stripped))
file.write('\n')
file.close()
print 'Done'
Once I have this cleaned file I see that the line counts match as expected. and when I grep the file on the strings which I know is breaking the record the exact line is printed in the console, however, I don't see that line in the output.
Eg.
Original File
cat team_contacts.csv | grep -A4 'Yennai Nambi'
,,,,,11/30/2017 11:45 AM UTC,,,,12/29/2017 11:51 AM UTC,,"Yennai Nambi Vandhavarai Yaemaatra Maattaen ;
Verum Yaeniyaay Naanirundhu Yaemaatra Maattaen ;
Naan Uyir Vaazhndhaal Ingaedhaan ;
Ooadivida Maattaen .",0,
Cleaned File
cat team_contacts_cleaned.csv | grep 'Naan Uyir Vaazhndhaal Ingaedhaan'
,,,,,11/30/2017 11:45 AM UTC,,,,12/29/2017 11:51 AM UTC,,Yennai Nambi Vandhavarai Yaemaatra MaOoadivida Maattaen .,0,
it looks like the data got erased when I cat the file however the grep is able to exactly locate the string which means the string is still there but why isn't it showing up?
Now when I move this cleaned file to hive it again breaks and data shows up like this
Verum Yaeniyaay Naanirundhu Yaemaatra Maattaen ; NULL NULL NULL NULL NULL NULLNULL
Naan Uyir Vaazhndhaal Ingaedhaan ; NULL NULL NULL NULL NULL NULL NULL NULLNULL
What am I missing here ?
I even tried a gawk program before writing a python code I faced the same issue.
gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' team_contacts.csv > team.csv