2

Good day,

I am totally new to Python and I am trying to do something with string.

I would like to remove any \n characters found between double quotes ( " ) only, from a given string :

str = "foo,bar,\n\"hihi\",\"hi\nhi\""

The desired output must be:

foo,bar
"hihi", "hihi"

Edit:

The desired output must be similar to that string: after = "foo,bar,\n\"hihi\",\"hihi\""

Any tips?

Cybrix
  • 3,248
  • 5
  • 42
  • 61
  • Arguably everything in your string is between double quotes, so what do you mean by "between double quotes"? Between the escaped double quotes (i.e. with backslashes ('\') in front of them)? – GreenMatt Jul 19 '11 at 20:50
  • @GreenMatt: Foo,bar isn't double quoted. – Cybrix Jul 19 '11 at 20:52
  • @GreenMatt: Yes I believe I must rely on regexp there. Any ideas? I am not a huge regexp guru here... – Cybrix Jul 19 '11 at 20:53
  • @Cybrix: you don't, see my answer. – orlp Jul 19 '11 at 21:01
  • @Cybrix: My first comment wasn't clear, sorry. I was trying to make the point that there are double quote around the whole string, so your question wasn't clear (at least to me) on the first reading. One way to deal with that would have been to enclose your string in single quotes. And you now have a few suggestions to choose from, including one using a regex. Have fun! – GreenMatt Jul 19 '11 at 21:02

6 Answers6

3

This should do:

def removenewlines(s):
    inquotes = False
    result = []

    for chunk in s.split("\""):
        if inquotes: chunk.replace("\n", "")
        result.append(chunk)
        inquotes = not inquotes

    return "\"".join(result)
orlp
  • 112,504
  • 36
  • 218
  • 315
  • 1
    printing the original string will still cause a newline on the second `\n`. – Manny D Jul 19 '11 at 20:46
  • Yes, I got that. I need to get rid: replacing every `'\n'` with nothing `''`. But only those that are found inside the `""`. – Cybrix Jul 19 '11 at 20:47
  • It is an obvious solution, isn't it. Strings are immutable (even though CPython optimises +=), so it must be `result += ch`, though, or `result = []` and then join. – Cat Plus Plus Jul 19 '11 at 20:55
3

A simple stateful filter will do the trick.

in_string  = False
input_str  = 'foo,bar,\n"hihi","hi\nhi"'
output_str = ''

for ch in input_str:
    if ch == '"': in_string = not in_string
    if ch == '\n' and in_string: continue
    output_str += ch

print output_str
Cat Plus Plus
  • 125,936
  • 27
  • 200
  • 224
  • Since I'm a total newbie to Python and your code is the only one that works: Thank you, answer accepted. – Cybrix Jul 19 '11 at 21:07
  • 1
    It's the least efficient solution on this page. Think 40MB of data without any double quotes. – Omri Barel Jul 19 '11 at 21:18
2
>>> str = "foo,bar,\n\"hihi\",\"hi\nhi\""
>>> re.sub(r'".*?"', lambda x: x.group(0).replace('\n',''), str, flags=re.S)
'foo,bar,\n"hihi","hihi"'
>>>

Short explanation:

  1. re.sub is a substitution engine. It takes a regular expression, a substitution function or expression, a string to work on, and other options.
  2. The regular expression ".*?" catches strings in double quotes that don't in themselves contain other double quotes (it has a small bug, because it wouldn't catch strings which contain escaped double-quotes).
  3. lambda x: ... is an expression which can be used wherever a function can be used.
  4. The substitution engine calls the function with the match object. x.group(0) is "the whole matched string", which also includes the double quotes. x.group(0) is the matched string with '\n' substituted for ''.
  5. The flag re.S tells re.sub that '\n' is a valid character to catch with a dot.

Personally I find longer functions that say the same thing more tiring and less readable, in the same way that in C I would prefer i++ to i = i + 1. It's all about what one is used to reading.

Omri Barel
  • 9,182
  • 3
  • 29
  • 22
  • 1
    +1, but it needs explanation, especially since the OP is new to Python. – Andrew Clark Jul 19 '11 at 21:02
  • Ugh, I would prefer a 6 line function than this unreadable mess. – orlp Jul 19 '11 at 21:03
  • Eli Collins wrote essentially the same, using an explicit function rather than a lambda expression. Why do you think that this is unreadable? re.sub has many option for a reason. – Omri Barel Jul 19 '11 at 21:06
2

Quick note: Python strings can use '' or "" as delimiters, so it's common practice to use one when the other is inside your string, for readability. Eg: 'foo,bar,\n"hihi","hi\nhi"'. On to the question...

You probably want the python regexp module: re. In particular, the substitution function is what you want here. There are a bunch of ways to do it, but one quick option is to use a regexp that identifies the "" substrings, then calls a helper function to strip any \n out of them...

import re
def helper(match):
    return match.group().replace("\n","")
input = 'foo,bar,\n"hihi","hi\nhi"'
result = re.sub('(".*?")', helper, input, flags=re.S)
Eli Collins
  • 8,375
  • 2
  • 34
  • 38
1

This regex works (assuming that quotes are correctly balanced):

import re
result = re.sub(r"""(?x) # verbose regex
    \n        # Match a newline
    (?!       # only if it is not followed by
     (?: 
      [^"]*"  # an even number of quotes
      [^"]*"  # (and any other non-quote characters)
     )*       # (yes, zero counts, too)
     [^"]*
     \z       # until the end of the string.
    )""", 
    "", str)
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
0

Something like this

Break the CSV data into columns.

>>> m=re.findall(r'(".*?"|[^"]*?)(,\s*|\Z)',s,re.M|re.S)
>>> m
[('foo', ','), ('bar', ',\n'), ('"hihi"', ','), ('"hi\nhi"', ''), ('', '')]

Replace just the field instances of '\n' with ''.

>>> [ field.replace('\n','') + sep for field,sep in m ]
['foo,', 'bar,\n', '"hihi",', '"hihi"', '']

Reassemble the resulting stuff (if that's really the point.)

>>> "".join(_)
'foo,bar,\n"hihi","hihi"'
S.Lott
  • 384,516
  • 81
  • 508
  • 779
  • "I can't make it work". Really? Anything *specific*? An error message? A value for `m` or something specific that could be used for debugging? Any hint at all? – S.Lott Jul 20 '11 at 02:34
  • It works in the shell. I will keep you updated because I can't make it work in my source where `s` is a much bigger string. – Cybrix Jul 20 '11 at 17:24
  • Something like: `"".join(_) NameError: global name '_' is not defined`. Again, working perfectly in the shell but not in my script. – Cybrix Jul 22 '11 at 02:01
  • @Cybrix: Read this. http://stackoverflow.com/questions/1538832/is-this-single-underscore-a-built-in-variable-in-python. It's a short cut that only works in interactive Python. You have to use proper assignment and a proper variable when writing a script. – S.Lott Jul 22 '11 at 09:58