4

I have a csv that I am not able to read using read_csv Opening the csv with sublime text shows something like:

col1,col2,col3
text,2,3
more text,3,4
HELLO

THIS IS FUN
,3,4

As you can see, the text HELLO THIS IS FUN takes three lines, and pd.read_csv is confused as it thinks these are three new observations. How can I parse that correctly in Pandas?

Thanks!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
  • 1
    Interesting problem. I we don't treat the new lines as new observations, how do we know if on "text, 2, 3" that it really should be "text, 2, 3 more text"? I am not sure you can properly format this with this input. – Scott Boston May 03 '17 at 13:53
  • yeah, that's a problem here... maybe by forcing the parser to find exactly three columns? – ℕʘʘḆḽḘ May 03 '17 at 14:00
  • 1
    I would open it in pure python and replace all whitespaces with e.g one underscore. You can identify the lines by the absence of commas within the newline characters. Is this behavior consistent with the uppercase letters ? – Moritz May 03 '17 at 14:04
  • thanks @moritz. good idea. can you please write some pseudo code to do that? – ℕʘʘḆḽḘ May 03 '17 at 14:04
  • 4
    what ? You have 3k+ reputation. I think you can do it on your own. Are you familiar with the "with open('file', 'r') as f: for line in f: do something " syntax ? – Moritz May 03 '17 at 14:06
  • i want to give you the opportunity to shine!!! :D – ℕʘʘḆḽḘ May 03 '17 at 14:13
  • Possible duplicate of [Handling extra newlines (carriage returns) in csv files parsed with Python?](http://stackoverflow.com/questions/11146564/handling-extra-newlines-carriage-returns-in-csv-files-parsed-with-python) –  May 03 '17 at 17:56
  • guys stop with finding duplicates that are not duplicates... – ℕʘʘḆḽḘ May 03 '17 at 18:00

1 Answers1

1

It looks like you'll have to preprocess the data manually:

with open('data.csv','r') as f:
    lines = f.read().splitlines()
processed = []
cum_c = 0
buffer = ''
for line in lines:
    buffer += line # Append the current line to a buffer
    c = buffer.count(',')
    if cum_c == 2:
        processed.append(line)
        buffer = ''
    elif cum_c > 2:
        raise # This should never happen

This assumes that your data only contains unwanted newlines, e.g. if you had data with say, 3 elements in one row, 2 elements in the next, then the next row should either be blank or contain only 1 element. If it has 2 or more, i.e. it's missing a necessary newline, then an error is thrown. You can accommodate this case if necessary with a minor modification.

Actually, it might be more efficient to remove newlines instead, but it shouldn't matter unless you have a lot of data.

Ken Wei
  • 3,020
  • 1
  • 10
  • 30