3

I have a text file of this form:

06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde

You can see that every line is separated by a line break, but some row contents have line breaks in them. So, simply separating by line doesn't parse every line properly.

As an example, for the 5th entry, I want my output to be 07/01/2016, 6:14 pm - abcde fghe

Here is my current code:

with open('file.txt', 'r') as text_file:
data = []
for line in text_file:
    row = line.strip()
    data.append(row)
Imran
  • 608
  • 10
  • 17

4 Answers4

1

Considering that ',' can only appear as a separator, we may check if the line has a comma and concatenate it to the last row if it doesn't:

data = []

with open('file.txt', 'r') as text_file:
    for line in text_file:
        row = line.strip()
        if ',' not in row:
            data[-1] += '\n' + row
        else:
            data.append(row)
hugos
  • 1,313
  • 1
  • 10
  • 19
  • Nothing so far is preventing a comma from appearing in the data (actually, in the data file linked in the question's comments, there are several). Reliable separation is not possible. – handle May 07 '17 at 19:51
  • When I posted there was only the example in the question and my code would be "the simplest thing that could possibly work". But with the data linked in the comments you are right, it wouldn't work... – hugos May 07 '17 at 19:57
1

Given your example input, you can use a regex with a forward lookahead:

pat=re.compile(r'^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)', re.S | re.M)

with open (fn) as f:
    pprint([m.group(1) for m in pat.finditer(f.read())])    

Prints:

['06/01/2016, 10:40 pm - abcde\n',
 '07/01/2016, 12:04 pm - abcde\n',
 '07/01/2016, 12:05 pm - abcde\n',
 '07/01/2016, 12:05 pm - abcde\n',
 '07/01/2016, 6:14 pm - abcde\n\nfghe\n',
 '07/01/2016, 6:20 pm - abcde\n',
 '07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl\n',
 '07/01/2016, 7:58 pm - abcde\n']

With the Dropbox example, prints:

['11/11/2015, 3:16 pm - IK: 12\n',
 '13/11/2015, 12:10 pm - IK: Hi.\n\nBut this is not about me.\n\nA donation, however small, will go a long way.\n\nThank you.\n',
 '13/11/2015, 12:11 pm - IK: Boo\n',
 '15/11/2015, 8:36 pm - IR: Root\n',
 '15/11/2015, 8:36 pm - IR: LaTeX?\n',
 '15/11/2015, 8:43 pm - IK: Ws\n']

If you want to delete the \n in what is captured, just add m.group(1).strip().replace('\n', '') to the list comprehension above.


Explanation of regex:

^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)

^                                                       start of line   
    ^  ^  ^  ^   ^                                      pattern for a date  
                       ^                                capture the rest...  
                           ^                            until (look ahead)
                                      ^ ^ ^             another date
                                                  ^     or
                                                     ^  end of string
dawg
  • 98,345
  • 23
  • 131
  • 206
0

You could use regular expressions (using the re module) to check for dates like this:

import re
with open('file.txt', 'r') as text_file:
  data = []
  for line in text_file:
    row = line.strip()
    if re.match(r'\d{2}/\d{2}/\d{4}.*'):  
      data.append(row)  # date: new record
    else:
      data[-1] += '\n' + row  # no date: append to last record

# '\d{2}': two digits
# '.*': any character, zero or more times
user2390182
  • 72,016
  • 6
  • 67
  • 89
  • Like any other approach so far: breaks if data contains the delimiter sequence (a date in this format). – handle May 07 '17 at 19:55
0

Simple testing for length:

#!python3
#coding=utf-8

data = """06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde"""

lines = data.split("\n")
out = []
for l in lines:
    c = l.strip()
    if c:
        if len(c) < 10:
            out[-1] += c
        else:
            out.append(c)
    #skip empty

for o in out:
    print(o)

results in:

06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcdefghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcdefgheijkl
07/01/2016, 7:58 pm - abcde

Does not contain the line breaks in the data!


But this one liner regular expression should do it (split on linebreak followed by digit), at least for the sample data (breaks when data contains linebreak followed by digit):

#!python3
#coding=utf-8

text_file = """06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde"""

import re
data = re.split("\n(?=\d)", text_file)

print(data)

for d in data:
    print(d)

Output:

   ['06/01/2016, 10:40 pm - abcde', '07/01/2016, 12:04 pm - abcde', '07/01/2016, 12:05 pm - abcde', '07/01/2016, 12:05 pm - abcde', '07/01/2016, 6:14 pm - abcde\n\
nfghe', '07/01/2016, 6:20 pm - abcde', '07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl', '07/01/2016, 7:58 pm - abcde']
06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde

(fixed with lookahead)

handle
  • 5,859
  • 3
  • 54
  • 82
  • Fails if data contains linebreak+digit, so the regular expression needs to be extended. On the other hand, this method [without sanitizing data, no delimiters] is easily broken if the data contains a new line with something that looks like a data header... – handle May 07 '17 at 19:30
  • What if one of the dates is `12/21/2016`? If you use `re.split(r'\n\d', txt)` your date becomes `2/21/2016`... – dawg May 07 '17 at 19:30
  • Oops, didn't notice that it consumes the digit. – handle May 07 '17 at 19:32
  • You can fix that with a look ahead -- like this solution I posted: `re.split(r'\n(?=^\d)', txt)` – dawg May 07 '17 at 19:34
  • Didn't know about `\Z`yet, thanks. [Ok you've removed it again..] – handle May 07 '17 at 19:38
  • `\Z` is actually not needed with `split` since the last element is included anyway. It is useful for a similar solution using `findall` or `finditer` that would not include the final element with out the look ahead seeing the end of string. – dawg May 07 '17 at 19:44