14

I have a CSV file that has fields that contain newlines e.g.:

A, B, C, D, E, F
123, 456, tree
, very, bla, indigo

(In this case third field in the second row is "tree\n"

I tried the following:

import csv
catalog = csv.reader(open('test.csv', 'rU'), delimiter=",", dialect=csv.excel_tab)
for row in catalog:
    print "Length: ", len(row), row

and the result I got was this:

Length:  6 ['A', ' B', ' C', ' D', ' E', ' F']
Length:  3 ['123', ' 456', ' tree']
Length:  4 ['   ', ' very', ' bla', ' indigo']

Does anyone have any idea how I can quickly remove extraneous newlines?

Thanks!

Maria Zverina
  • 10,863
  • 3
  • 44
  • 61
mo5470
  • 937
  • 3
  • 10
  • 26
  • Is the comma always at the end or inside of a field, i.e. not at the beginning? Alternatively, is the number of columns fixed? If neither, it can't be done. – Chinmay Kanchi Jun 21 '12 at 20:55
  • The number of columns is fixed, to 6 in this example. – mo5470 Jun 21 '12 at 21:03
  • Are you guaranteed only one newline per field? Also do you expect to have to handle commas inside the fields? I.e. quoted fields? – Maria Zverina Jun 21 '12 at 21:26
  • is it true that both `','` and `'\n'` terminate a field and there can't be empty fields? – jfs Jun 21 '12 at 22:04
  • What is the result you are looking for? If there are is it supposed to be `[['A','B','C','D','E','F'],['123','456','tree' ,'very','bla','indigo']]` or something else? Will any of the fields have embedded quoted commas, like this: `A,"B,C", D`? – the wolf Jun 22 '12 at 00:35

6 Answers6

18

Suppose you have this Excel spreadsheet:

Common 'gottchas' in an Excel file

Note:

  1. the multi-line cell in C2;
  2. embedded comma in C1 and D3;
  3. blank cells, and cell with a space in D4.

Saving that as CSV in Excel, you will get this csv file:

A1,B1,"C1,+comma",D1
,B2,"line 1
line 2",D2
,,C3,"D3,+comma"
,,,D4 space

Assumably, you will want to read that into Python with the blank cells still having meaning and the embedded comma treated correctly.

So, this:

with open("test.csv", 'rU') as csvIN:
    outCSV=(line for line in csv.reader(csvIN, dialect='excel'))

    for row in outCSV:
        print("Length: ", len(row), row) 

correctly produces the 4x4 List of List matrix represented in Excel:

Length:  4 ['A1', 'B1', 'C1,+comma', 'D1']
Length:  4 ['', 'B2', 'line 1\nline 2', 'D2']
Length:  4 ['', '', 'C3', 'D3,+comma']
Length:  4 ['', '', '', 'D4 space']

The example CSV file you posted lacks quotes around the field with an 'extra newline' rendering the meaning of that newline ambiguous. Is it a new row or a multi-line field?

Therefor, you can only interpret this csv file:

A, B, C, D, E, F
123, 456, tree
, very, bla, indigo

as a one dimension list like so:

with open("test.csv", 'rU') as csvIN:
   outCSV=[field.strip() for row in csv.reader(csvIN, delimiter=',') 
              for field in row if field]

Which produces this one dimensional list:

['A', 'B', 'C', 'D', 'E', 'F', '123', '456', 'tree', 'very', 'bla', 'indigo']

This can then be interpreted and regrouped into any sub grouping as you wish.

The idiomatic regrouping method in python uses zip like so:

>>> zip(*[iter(outCSV)]*6)
[('A', 'B', 'C', 'D', 'E', 'F'), ('123', '456', 'tree', 'very', 'bla', 'indigo')]

Or, if you want a list of lists, this is also idiomatic:

>>> [outCSV[i:i+6] for i in range(0, len(outCSV),6)]
[['A', 'B', 'C', 'D', 'E', 'F'], ['123', '456', 'tree', 'very', 'bla', 'indigo']]

If you can change how your CSV file is created, it will be less ambiguous to interpret.

the wolf
  • 34,510
  • 13
  • 53
  • 71
  • 1
    Thank you for that answer. FYI, you are creating an iterator from an iterator, which is useless. `outCSV=(line for line in csv.reader(csvIN, dialect='excel'))` may be replaced by `outCSV=csv.reader(csvIN, dialect='excel')` – Réchèr Aug 09 '16 at 08:57
6

This will work if you have non blanks cells

data = [['A', ' B', ' C', ' D', ' E', ' F'],
['123', ' 456', ' tree'],
['   ', ' very', ' bla', ' indigo']]

flat_list = chain.from_iterable(data)
flat_list = [cell for cell in flat_list if cell.strip() != ''] # remove blank cells

rows = [flat_list[i:i+6] for i in range(0, len(flat_list), 6)] # chunk into groups of 6 
print rows 

Output:

[['A', ' B', ' C', ' D', ' E', ' F'], ['123', ' 456', ' tree', ' very', ' bla', ' indigo']]

If you have blank cells in the input, this will work most of the time:

data = [['A', ' B', ' C', ' D', ' E', ' F'],
['123', ' 456', ' tree'],
['   ', ' very', ' bla', ' indigo']]

clean_rows = []
saved_row = []

for row in data:
    if len(saved_row):
        row_tail = saved_row.pop()
        row[0] = row_tail + row[0]  # reconstitute field broken by newline
        row = saved_row + row       # and reassemble the row (possibly only partially)
    if len(row) >= 6:
        clean_rows.append(row)
        saved_row = []
    else:
        saved_row = row


print clean_rows 

Output:

[['A', ' B', ' C', ' D', ' E', ' F'], ['123', ' 456', ' tree   ', ' very', ' bla', ' indigo']]

However even the second solution will fail with input such

A,B,C,D,E,F\nG
1,2,3,4,5,6

In this case the input is ambiguous and no algorithm will be able to guess if you meant:

A,B,C,D,E,F
G\n1,2,3,4,5,6 

(or the input give above)

If this could be the case for you, you'll have to go back to the person saving the data and make them save it in cleaner format (btw open office quotes newlines in CSV files far better then Excel).

Maria Zverina
  • 10,863
  • 3
  • 44
  • 61
  • The extra newline is after `tree` – Chinmay Kanchi Jun 21 '12 at 21:11
  • Thanks for the suggestion. However, (and maybe I didn't express my requirements properly) my data could look like this: data = [['A', ' ', ' C', ' ', ' E', ' F'], ['123', ' ', ' tree'], [' ', ' very', ' bla', ' indigo']] in which case, removing empty elements doesn't really help. – mo5470 Jun 21 '12 at 21:14
  • Ah - I see ... ok ... getting closer ... one more go :) – Maria Zverina Jun 21 '12 at 21:19
  • @carrot-top Don't see why it wouldn't work with embedded commas - as long as the csv reader handles them, there is nothing in my algorithm that would break because of embedded commass – Maria Zverina Jun 22 '12 at 09:53
  • It seems you're keeping all rows in memory, which wouldn't work for large files. How could I do the same without loading the whole file in memory at once? – Daniel Serodio Jul 23 '12 at 17:47
  • Instead of doing clean_rows.append(row), insert something like process_my_row(row) and define a function that consumes the row. – Maria Zverina Jul 24 '12 at 08:18
1

This should work. (Warning: Brain compiled code)

with open('test.csv', 'rU') as infile:
   data = []
   for line in infile:
       temp_data = line.split(',')
       try:
           while len(temp_data) < 6: #column length
               temp_data.extend(infile.next())
       except StopIteration: pass
       data.append(temp_data)
Chinmay Kanchi
  • 62,729
  • 22
  • 87
  • 114
1

This works with the CSV module and cleans blank fields and lines:

import csv
import StringIO

data="""A, B, C, D, E, F
123, 456, tree

,,
, very, bla, indigo"""

f=StringIO.StringIO(data)   #used just to simulate a file. Use your file here...
reader = csv.reader(f)
out=[]
for line in reader:
    line=[x.strip() for x in line if x]   # remove 'if x' if you want blank fields
    if len(line):
        out.append(line)

print out        

Prints:

[['A', ' B', ' C', ' D', ' E', ' F'], 
 ['123', '456', 'tree'], 
 ['very', 'bla', 'indigo']]

If you want that in 6 col chunks:

cols=6        
out=[i for sl in out for i in sl]                      # flatten out
out=[out[i:i+cols] for i in range(0, len(out), cols)]  # rechunk into 'cols' 

Prints:

[['A', 'B', 'C', 'D', 'E', 'F'],
 ['123', '456', 'tree', 'very', 'bla', 'indigo']]
  • @Chinmay Kanchi: How so? It handles blank lines, blank fields, CSV handles the complexity of the csv stuff, and it is easy to recheck into N col fields. –  Jun 21 '12 at 21:49
  • +1, but this would be kinder on memory with generators instead of lists. – the wolf Jun 22 '12 at 00:37
1

If number of fields in each row is the same and fields can't be empty:

from itertools import izip_longest

nfields = 6
with open(filename) as f:
     fields = (field.strip() for line in f for field in line.split(',') if field)
     for row in izip_longest(*[iter(fields)]*nfields): # grouper recipe*
         print(row)

* grouper recipe

Output

('A', 'B', 'C', 'D', 'E', 'F')
('123', '456', 'tree', 'very', 'bla', 'indigo')
Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
0

If you know the number of columns, the best way is to ignore end of lines and then split.

Something like this

with open(filename, 'rU') as fp:
    data = ''.join(fp.readlines())

data = data.split(',')
for n in range(0, len(data), 6)
    print(data[n:n+6])

You can convert it easily into a generator if you prefer:

def read_ugly_file(filename, delimiter=',', columns=6):
    with open(filename, 'rU') as fp:
        data = ''.join(fp.readlines())

    data = data.split(delimiter)
    for n in range(0, len(data), columns)
        yield data[n:n+columns]

for row in read_ugly_file('myfile.csv'):
    print(row)
Hernan
  • 5,811
  • 10
  • 51
  • 86