Remove multiple EOL in file

Question

I have a tab delimited file with \n EOL characters that looks something like this:

User Name\tCode\tTrack\tColor\tNote\n\nUser Name2\tCode2\tTrack2\tColor2\tNote2\n

I am taking this input file and reformatting it into a nested list using split('\t'). The list should look like this:

[['User Name','Code','Track','Color','Note'],
 ['User Name2','Code2','Track2','Color2','Note2']]

The software that generates the file allows the user to press "enter" key any number of times while filling out the "Note" field. It also allows the user to press "enter" creating any number of newlines without entering any visible text in the "Note" field at all.

Lastly, the user may press "enter" any number of times in the middle of the "Note" creating multiple paragraphs, but this would be such a rare occurrence from the operational standpoint that I am willing to leave this eventuality not addressed if it complicates the code much. This possibility is really, really low priority.

As seen in the sample above, these actions can result in a sequence of "\n\n..." codes of any length preceding, trailing or replacing the "Note" field. Or to put it this way, the following replacements are required before I can place the file object into a list:

\t\n\n... preceding "Note" must become \t
\n\n... trailing "note" must become \n
\n\n... in place of "note" must become \n
\n\n... in the middle of the text note must become a single whitespace, if easy to do

I have tried using strip() and replace() methods without success. Does the file object need to be copied into something else first before replace() method can be used on it?

I have experience with Awk, but I am hoping Regular Expressions are not needed for this as I am very new to Python. This is the code that I need to improve in order to address multiple newlines:

marker = [i.strip() for i in open('SomeFile.txt', 'r')]

marker_array = []
for i in marker:
    marker_array.append(i.split('\t'))

for i in marker_array:
    print i

Can you modify the software that generate `SomeFile.txt`? If so, is it written in Python? — falsetru, Jul 09 '13 at 07:14
It would have helped had you used the `csv` module to *write* this data and properly quote the Note field. — Martijn Pieters, Jul 09 '13 at 07:32
The software that generates the text file is not written in Python. Modifying it is not an option. — I_Ridanovic, Jul 09 '13 at 15:11

Martijn Pieters · Accepted Answer · 2013-07-09T08:23:24.367

Count the tabs; if you presume that the note field never has 4 tabs on one line in it, you can collect the note until you find a line that does have 4 tabs in it:

def collapse_newlines(s):
    # Collapse multiple consecutive newlines into one; removes trailing newlines
    return '\n'.join(filter(None, s.split('\n')))

def read_tabbed_file(filename):
    with open(filename) as f:
        row = None
        for line in f:
            if line.count('\t') < 4:   # Note continuation
                row[-1] += line
                continue

            if row is not None:
                row[-1] = collapse_newlines(row[-1])
                yield row

            row = line.split('\t')

        if row is not None:
            row[-1] = collapse_newlines(row[-1])
            yield row

The above generator function will not yield a row until it is certain that there is no note continuing on the next line, effectively looking ahead.

Now use the read_tabbed_file() function as a generator and loop over the results:

for row in read_tabbed_file(yourfilename):
    # row is a list of elements

Demo:

>>> open('/tmp/test.csv', 'w').write('User Name\tCode\tTrack\tColor\tNote\n\nUser Name2\tCode2\tTrack2\tColor2\tNote2\n')
>>> for row in read_tabbed_file('/tmp/test.csv'):
...     print row
... 
['User Name', 'Code', 'Track', 'Color', 'Note']
['User Name2', 'Code2', 'Track2', 'Color2', 'Note2']

In this case I would probably read the whole file as a string and use splitlines() to avoid holding the file handle open until the generator halts. — llb, Jul 09 '13 at 07:40
@llb: And what if the file is a million lines short? An open file handle is cheap. Holding all data in memory is not. — Martijn Pieters, Jul 09 '13 at 07:41

score 1 · Answer 2 · answered Jul 09 '13 at 07:34

1

The first problem you're having is in - which tries to be helpful and reads in one line of text from the file at a time.

>>> [i for i in open('SomeFile.txt', 'r') ]
['User Name\tCode\tTrack\tColor\tNote\n', '\n', 'User Name2\tCode2\tTrack2\tColor2\tNote2\n', '\n']

Adding in the call to .strip() does strip the whitespace from each line, but that leaves you with empty lines - it doesn't take those empty elements out of the list.

>>> [i.strip() for i in open('SomeFile.txt', 'r') ]
['User Name\tCode\tTrack\tColor\tNote', '', 'User Name2\tCode2\tTrack2\tColor2\tNote2', '']

However, you can provide in if clause to the list comprehension to make it drop lines that only have a newline:

>>> [i.strip() for i in open('SomeFile.txt', 'r') if len(i) >1 ]
['User Name\tCode\tTrack\tColor\tNote', 'User Name2\tCode2\tTrack2\tColor2\tNote2']
>>>

answered Jul 09 '13 at 07:34

James Polley

7,977
2
29
33

Note that the OP's Note fields may contain embedded newlines as well as beginning or ending with newlines. Splitting on newlines (or iterating over a stream) will not handle this case effectively. – llb Jul 09 '13 at 07:36
Correct, but note that the OP also said they don't care about handling that wrinkle right now (or at least, it's a low priority). – James Polley Jul 09 '13 at 07:38
Thanks for this very simple answer. It almost works but I decided to go with the generator based solution above. – I_Ridanovic Jul 11 '13 at 03:43
The generator solution is simple and elegant and I like it much better than an "almost works" solution. – James Polley Jul 11 '13 at 21:19

score 0 · Answer 3 · edited May 23 '17 at 11:47

0

I think, that csv module will help you.

E.g. look at this: Parsing CSV / tab-delimited txt file with Python.

edited May 23 '17 at 11:47

Community

1
1

answered Jul 09 '13 at 07:22

hpjepsen

11

1

csv will not help OP, because the file can contain multiple newlines for a record. – falsetru Jul 09 '13 at 07:27

Remove multiple EOL in file

3 Answers3