1

I have a CSV file that I am trying to parse but the problem is that one of the cells contains blocks of data full of nulls and line breaks. I need enclose each row inside an array and merge all the content from this particular cell within its corresponding row. I recently posted and similar question and the answer solved my problem partially, but I am having problems building a loop that iterates through every single line that does not meet a certain start condition. The code that I have merges only the first line that does not meet that condition, but it breaks after that.

I have:

file ="myfile.csv"
condition = "DAT"

data = open(file).read().split("\n")
for i, line in enumerate(data):
    if not line.startswith(condition):
        data[i-1] = data[i-1]+line
        data.pop(i)
print data

For a CSV that looks like this:

Case  | Info
-------------------
DAT1    single line  
DAT2    "Berns, 17, died Friday of complications from Hutchinson-Gilford progeria   syndrome, commonly known as progeria. He was diagnosed with progeria when he was 22 months old. His physician parents founded the nonprofit Progeria Research Foundation after his diagnosis.

Berns became the subject of an HBO documentary, ""Life According to Sam."" The exposure has brought greater recognition to the condition, which causes musculoskeletal degeneration, cardiovascular problems and other symptoms associated with aging.

Kraft met the young sports fan and attended the HBO premiere of the documentary in New    York in October. Kraft made a $500,000 matching pledge to the foundation.

The Boston Globe reported that Berns was invited to a Patriots practice that month, and gave the players an impromptu motivational speech.

DAT3    single line
DAT4    YWYWQIDOWCOOXXOXOOOOOOOOOOO 

It does join the full sentence with the previous line. But when it hits a double space or double line it fails and registers it as a new line. For example, if I print:

data[0]

The output is:

DAT1    single line

If I print:

data[1]

The output is:

DAT2    "Berns, 17, died Friday of complications from Hutchinson-Gilford progeria syndrome, commonly known as progeria. He was diagnosed with progeria when he was 22 months old. His physician parents founded the nonprofit Progeria Research Foundation after his diagnosis.

But if I print:

data[2]

The output is:

Berns became the subject of an HBO documentary, ""Life According to Sam."" The exposure has brought greater recognition to the condition, which causes musculoskeletal degeneration, cardiovascular problems and other symptoms associated with aging.

Instead of:

DAT3    single line

How do I merge that full bull of text on the column "Info" so that it always matches the corresponding DAT row instead on popping as a new row, regardless of null or new line characters?

  • You use `pop` while you are still iterating over the data. You shouldn't change things you are iterating over. Copy the data you want to a new list instead. – kylieCatt Jan 15 '14 at 18:56
  • why don't you use the cvs module ? http://docs.python.org/2/library/csv.html It's able to handle various delimiters and escape chars Probably delimiter="\t" in your case. – gawel Jan 15 '14 at 18:59

2 Answers2

0

Changing data while iterating over it is "bad"

new_data = []
for line in data:
    if not new_data or line.startswith(condition):
        new_data.append(line)
    else:
        new_data[-1] += line
print new_data
cmd
  • 5,754
  • 16
  • 30
  • This works with the demo CSV. However, with the actual data it throws a "IndexError: list index out of range" for the line that contains: main_records[i-1] += line –  Jan 15 '14 at 19:16
  • nvm...I realized that is the heading line what is messing up everything. If removed from the CSV it works. That will be another question. Thank you! –  Jan 15 '14 at 19:43
  • 1
    edited to take into account if the first line fails condition – cmd Jan 15 '14 at 20:16
  • weird...somehow in the cells that have multiple lines is cutting of the beginning, like giving me "Berns became the subject of an HBO documentary..." instead of "DAT2 "Berns, 17, died Friday of complications from Hutchinson-Gilford progeria syndrome, commonly known as progeria. He was diagnosed with progeria when he was 22 months old. His physician parents founded the nonprofit Progeria Research Foundation after his diagnosis. Berns became the subject of an HBO documentary..." Is doing the merging to the previous line but cutting off whatever that was there previously. –  Jan 15 '14 at 20:58
  • that helped but still having that issue...i think regex is solving it –  Jan 15 '14 at 21:08
0

You can split lines with regular expression directly into data:

Python

import re

f = open("myfile.csv")
text = f.read()
data = re.findall("\n(DAT\d+.*)", text)

Correct me if doesn't help.

UPDATE:

I believe, This would fix the problem with new lines:

import re

f = open("myfile.csv")
text = f.read()
lines = re.split(r"\n(?=DAT\d+)", text)
lines.pop(0)
Mehdi
  • 4,202
  • 5
  • 20
  • 36
  • 1
    OK, this one does not include the rest of the content after it breaks into a new line. It breaks after the first new line within the cell and it does not contain the rest of the cell or the following cell. I was getting a similar output doing something like "if line.startswith(condition): new_data.append(line)" –  Jan 15 '14 at 19:31
  • 1
    you are right, I should use `re.split`. I have an update for my answer :) – Mehdi Jan 15 '14 at 20:52
  • Wow, Mehdi, that's doing it! ...One thing, how can I restrict the regex to exactly "DAT" because if I use DAT0001, DAT0002 it works but if there are characters instead of numbers then it doesn't, for example, DAT0001 and DAT0002 would be printed as separate lines, yet DAT000 and DATNEXT will merge together. –  Jan 15 '14 at 21:07
  • 1
    `DAT\d+` matches all 'DAT's which followed by digits. If you want it to match with any word start by DAT replace `\d` with `.`: `\n(?=DAT.+)` – Mehdi Jan 15 '14 at 21:12
  • Amazing, can you recommend the book or course where your learned regex? –  Jan 15 '14 at 21:19
  • 1
    Well, learning regular expression is not all in reading books. First you should understand basics like what you can and what you cannot do with regular-expression, then you should practice it in programming. I hope this link helps you: http://stackoverflow.com/questions/4736/learning-regular-expressions – Mehdi Jan 15 '14 at 21:30
  • Thank you, will go through it! –  Jan 15 '14 at 22:20