How to read a text file where some of the contents have line breaks?

Question

I have a text file of this form:

06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde

You can see that every line is separated by a line break, but some row contents have line breaks in them. So, simply separating by line doesn't parse every line properly.

As an example, for the 5th entry, I want my output to be 07/01/2016, 6:14 pm - abcde fghe

Here is my current code:

with open('file.txt', 'r') as text_file:
data = []
for line in text_file:
    row = line.strip()
    data.append(row)

Is the data that can contain line breaks itself contained in double-quotes, by any chance? — juanpa.arrivillaga, May 07 '17 at 18:50
Could you show how `data` should look? It is unclear from your description. I see income, but it is not clear how outcome should look. — TitanFighter, May 07 '17 at 18:50
@Imran then you probably have some type of CSV. If you know what the separator is, you could do this trivially using the `csv` module. Can you post the first few rows of the data? — juanpa.arrivillaga, May 07 '17 at 18:57
@Imran, you mean you want to remove `fghe` and `ijkl` and just keep elements like this `06/01/2016, 10:40 pm - abcde`? — TitanFighter, May 07 '17 at 18:57
@TitanFighter No, `"abcde\n\nfghe\n\nijkl"` are all part of the data. — juanpa.arrivillaga, May 07 '17 at 18:58
@TitanFighter no I want it all in a single element - it's part of the contents. — Imran, May 07 '17 at 18:58
Are the linebreaks in the data identical to the ones that separate the actual data lines? — handle, May 07 '17 at 19:04
@Imran can you post a few rows of the actual data? It is almost certainly some sort of csv, in which case, the right answer is to simply use the `csv` module and not write your own parser. — juanpa.arrivillaga, May 07 '17 at 19:08
@juanpa.arrivillaga the actual data is the same as what I posted but with more text and more rows. — Imran, May 07 '17 at 19:13
But as you said it has *doube quotes*, right? Where are the double quotes, and more importantly **what is the delimiter**? — juanpa.arrivillaga, May 07 '17 at 19:13
@juanpa.arrivillaga here's a sample: https://www.dropbox.com/s/leuvfsnu6y98v00/test.txt?dl=0 — Imran, May 07 '17 at 19:19
You could try parsing the first item in each line with [`strptime()`](https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime) and see if it raises a `ValueError` or not (i.e. via `try`/`except`). If it doesn't, you can assume it's part of the previous line. — martineau, May 07 '17 at 19:39
@ Imran If possible, you need to sanitize user input and delimit the data field. Otherwise data sets cannot be separated reliably. — handle, May 07 '17 at 19:53
Pretty much all you need is to replace `\n\n` for a white space: `'\n'.join(data).replace('\n\n', ' ').split('\n'):` — , May 07 '17 at 20:13

hugos · Answer 1 · 2017-05-07T19:07:17.353

1

Considering that ',' can only appear as a separator, we may check if the line has a comma and concatenate it to the last row if it doesn't:

data = []

with open('file.txt', 'r') as text_file:
    for line in text_file:
        row = line.strip()
        if ',' not in row:
            data[-1] += '\n' + row
        else:
            data.append(row)

edited May 07 '17 at 19:07

answered May 07 '17 at 19:00

hugos

1,313
1
10
19

Nothing so far is preventing a comma from appearing in the data (actually, in the data file linked in the question's comments, there are several). Reliable separation is not possible. – handle May 07 '17 at 19:51
When I posted there was only the example in the question and my code would be "the simplest thing that could possibly work". But with the data linked in the comments you are right, it wouldn't work... – hugos May 07 '17 at 19:57

dawg · Accepted Answer · 2017-05-07T20:57:44.647

Given your example input, you can use a regex with a forward lookahead:

pat=re.compile(r'^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)', re.S | re.M)

with open (fn) as f:
    pprint([m.group(1) for m in pat.finditer(f.read())])

Prints:

['06/01/2016, 10:40 pm - abcde\n',
 '07/01/2016, 12:04 pm - abcde\n',
 '07/01/2016, 12:05 pm - abcde\n',
 '07/01/2016, 12:05 pm - abcde\n',
 '07/01/2016, 6:14 pm - abcde\n\nfghe\n',
 '07/01/2016, 6:20 pm - abcde\n',
 '07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl\n',
 '07/01/2016, 7:58 pm - abcde\n']

With the Dropbox example, prints:

['11/11/2015, 3:16 pm - IK: 12\n',
 '13/11/2015, 12:10 pm - IK: Hi.\n\nBut this is not about me.\n\nA donation, however small, will go a long way.\n\nThank you.\n',
 '13/11/2015, 12:11 pm - IK: Boo\n',
 '15/11/2015, 8:36 pm - IR: Root\n',
 '15/11/2015, 8:36 pm - IR: LaTeX?\n',
 '15/11/2015, 8:43 pm - IK: Ws\n']

If you want to delete the \n in what is captured, just add m.group(1).strip().replace('\n', '') to the list comprehension above.

Explanation of regex:

^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)

^                                                       start of line   
    ^  ^  ^  ^   ^                                      pattern for a date  
                       ^                                capture the rest...  
                           ^                            until (look ahead)
                                      ^ ^ ^             another date
                                                  ^     or
                                                     ^  end of string

This works perfectly thanks! Can you explain what the code inside `re.compile` does? — Imran, May 07 '17 at 20:40

score 0 · Answer 3 · answered May 07 '17 at 18:58

0

You could use regular expressions (using the re module) to check for dates like this:

import re
with open('file.txt', 'r') as text_file:
  data = []
  for line in text_file:
    row = line.strip()
    if re.match(r'\d{2}/\d{2}/\d{4}.*'):  
      data.append(row)  # date: new record
    else:
      data[-1] += '\n' + row  # no date: append to last record

# '\d{2}': two digits
# '.*': any character, zero or more times

answered May 07 '17 at 18:58

user2390182

72,016
6
67
89

Like any other approach so far: breaks if data contains the delimiter sequence (a date in this format). – handle May 07 '17 at 19:55

handle · Answer 4 · 2017-05-07T19:58:43.957

Simple testing for length:

#!python3
#coding=utf-8

data = """06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde"""

lines = data.split("\n")
out = []
for l in lines:
    c = l.strip()
    if c:
        if len(c) < 10:
            out[-1] += c
        else:
            out.append(c)
    #skip empty

for o in out:
    print(o)

results in:

06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcdefghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcdefgheijkl
07/01/2016, 7:58 pm - abcde

Does not contain the line breaks in the data!

But this one liner regular expression should do it (split on linebreak followed by digit), at least for the sample data (breaks when data contains linebreak followed by digit):

#!python3
#coding=utf-8

text_file = """06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde"""

import re
data = re.split("\n(?=\d)", text_file)

print(data)

for d in data:
    print(d)

Output:

   ['06/01/2016, 10:40 pm - abcde', '07/01/2016, 12:04 pm - abcde', '07/01/2016, 12:05 pm - abcde', '07/01/2016, 12:05 pm - abcde', '07/01/2016, 6:14 pm - abcde\n\
nfghe', '07/01/2016, 6:20 pm - abcde', '07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl', '07/01/2016, 7:58 pm - abcde']
06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde

(fixed with lookahead)

Fails if data contains linebreak+digit, so the regular expression needs to be extended. On the other hand, this method [without sanitizing data, no delimiters] is easily broken if the data contains a new line with something that looks like a data header... — handle, May 07 '17 at 19:30
What if one of the dates is `12/21/2016`? If you use `re.split(r'\n\d', txt)` your date becomes `2/21/2016`... — dawg, May 07 '17 at 19:30
You can fix that with a look ahead -- like this solution I posted: `re.split(r'\n(?=^\d)', txt)` — dawg, May 07 '17 at 19:34
Didn't know about `\Z`yet, thanks. [Ok you've removed it again..] — handle, May 07 '17 at 19:38
`\Z` is actually not needed with `split` since the last element is included anyway. It is useful for a similar solution using `findall` or `finditer` that would not include the final element with out the look ahead seeing the end of string. — dawg, May 07 '17 at 19:44

How to read a text file where some of the contents have line breaks?

4 Answers4

Linked