1

I am stuck since 1day with a weird problem. I have a CSV file which I need to import into my hive table. The CSV file, however, has newline characters embedded in between the strings. As the files are huge I am not able to use a text editor to replace the '\n' character.

I wrote a python program to help me clean the file. I read each row from the CSV file and if I encounter any newline character I replace it with space. Below is my program.

# -*- coding: utf-8 -*-
import csv
import sys


file = open("team_contacts_cleaned.csv","w")
with open('team_contacts.csv') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
    stripped = [col.replace('\n', '') for col in row]
    file.write(','.join(stripped))
    file.write('\n')
file.close()
print 'Done'

Once I have this cleaned file I see that the line counts match as expected. and when I grep the file on the strings which I know is breaking the record the exact line is printed in the console, however, I don't see that line in the output.

Eg.

Original File

cat team_contacts.csv | grep -A4 'Yennai Nambi'
,,,,,11/30/2017 11:45 AM UTC,,,,12/29/2017 11:51 AM UTC,,"Yennai Nambi Vandhavarai Yaemaatra Maattaen ;
Verum Yaeniyaay Naanirundhu Yaemaatra Maattaen ;
Naan Uyir Vaazhndhaal Ingaedhaan ;
Ooadivida Maattaen .",0,

Cleaned File

cat team_contacts_cleaned.csv | grep 'Naan Uyir Vaazhndhaal Ingaedhaan'
,,,,,11/30/2017 11:45 AM UTC,,,,12/29/2017 11:51 AM UTC,,Yennai Nambi Vandhavarai Yaemaatra MaOoadivida Maattaen .,0,

it looks like the data got erased when I cat the file however the grep is able to exactly locate the string which means the string is still there but why isn't it showing up?

Now when I move this cleaned file to hive it again breaks and data shows up like this

Verum Yaeniyaay Naanirundhu Yaemaatra Maattaen ;    NULL    NULL    NULL    NULL    NULL    NULLNULL

Naan Uyir Vaazhndhaal Ingaedhaan ;  NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULLNULL    

What am I missing here ?

I even tried a gawk program before writing a python code I faced the same issue.

gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' team_contacts.csv > team.csv
Sridhar Iyer
  • 189
  • 1
  • 7
  • This probably isn't your issue, but… is there a reason you're using `csv` to read the input CSV, but not using `csv` to write the output CSV? – abarnert Apr 06 '18 at 18:34
  • 1
    Meanwhile, I'm having a hard time understanding what problem you're describing. It would really help if you give us a [mcve]—with sample input, expected output, and actual output—so we can reproduce it ourselves and debug it. (Follow that link above to the help page; it has a lot of useful information.) – abarnert Apr 06 '18 at 18:36
  • No need for Python to change newlines to spaces, just do `tr '\n' ' ' < team_contacts.csv > team_contacts_cleaned.csv`. – agc Apr 06 '18 at 22:48

1 Answers1

0

I resolved this problem. when I broke the file into parts, I got this error

_csv.Error: new-line character seen in unquoted field

The following link helped me resolve it. https://stackoverflow.com/a/32827068/3538426

This can be closed. Updated the code

import csv
import sys

csv.field_size_limit(sys.maxsize)

file = open("team_contacts_cleaned.csv","w")
with open('team_contacts.csv','rU') as csvfile:
    reader = csv.reader(csvfile)
    writer = csv.writer(file,quoting=csv.QUOTE_ALL)
    for row in reader:
        stripped = [col.replace('\n', ' ') for col in row]
        writer.writerow(stripped)
file.close()
print 'Done'

@agc your solution will not work as it will remove all newlines.

Sridhar Iyer
  • 189
  • 1
  • 7