0

I have a csv file that most of the time starts with a date, but sometimes with text. So f.E:

time                       user   text
2019-01-01T00:09:59-05:00: user1: text1 
2019-01-01T00:09:59-05:00: user1: text4
2019-01-01T00:10:10-05:00: operator: error \
 ERRCODE: error 'operator' info.
2019-01-01T00:09:59-05:00: user2: text5

As you can see, sometimes there's an error that gets logged in a new line. I want to read this into a pandas DF, and convert the first column into date format. However, ERRCODEs mess it up. Can I somehow read the file conditionally (I have loads of data, so speed is a concern) so that if the row does not start with a date, it gets concatenated into the previous row's text column?

lte__
  • 7,175
  • 25
  • 74
  • 131

1 Answers1

1

I know you asked for a pandas solution, but recently I've encountered a similar problem and my solution was to open each file as a text file, replace the faulty parts, save back and then open with read_csv.

For example, in your case, I'd do something along the lines of:

for filename in files:
    with open(filename,'r') as f:
        file = f.read()
        file = file.replace('error \n','error')
    with open(os.path.join(folder,filename),'w') as f:
        f.write(file)

...or something like that. Afterwards, the read_csv becomes much simpler, and no iteration over lines is required.

Hope it helps!

Itamar Mushkin
  • 2,803
  • 2
  • 16
  • 32
  • That's a huge step forward, thank you! Looks like my data is dirtier than expected - there are also new lines where there's no `error \n` at the end. Do you think there's a generalisation where I could replace any newline character if it's not followed by "2019" ? – lte__ May 16 '19 at 11:08
  • I'm trying some regex `file = re.sub(r'^(\n+20)', ' ', file)` but this won't work – lte__ May 16 '19 at 11:15
  • You're looking for `\n` in the beginning of a line. I'm not regex master myself, but take a look at this: https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word – Itamar Mushkin May 16 '19 at 11:33