7

This is quite the opposite of Get pandas.read_csv to read empty values as empty string instead of nan

Given the following CSV file:

col,val
"hi
there",1
,2
\f\,3
"",4
"""hi""",5

I want it to be read as:

         col  val
0  hi\nthere    1
1        NaN    2
2        \f\    3
3               4
4       "hi"    5

That is, reading the empty field (val 2) as NaN, while keeping the empty string (val 4) as empty string.

Currently pd.read_csv converts val 2 and val 4 both as NaN, or if I use na_filter=False both are kept as empty string.

I'm assuming these two representations mean different things in CSV (empty fields vs empty string), so I'm assuming pandas should be able to distinguish this too.

Is there a way to make pandas to distinguish these two cases? Or is my assumption wrong, that the two representations are actually the same? (please point me to a CSV standard if the second one is the case)

More information, I got the CSV by exporting BigQuery table (with the intended meaning, val 2 is null and val 4 is empty string) into CSV. And I want to get the exact same table back. So this example is not just a contrived example, but is actually used by BigQuery when exporting to CSV.

EDIT: further search reveals a Github issue 4 years ago that discusses similar point (see this comment, for example), and one of the commenter mentions that there is some coercion (I'm not sure what they refer to, but I understand it as coercion between empty field and empty string). Is this still happening?

justhalf
  • 8,960
  • 3
  • 47
  • 74
  • if you use the csv module, and print out each line, is there a difference between the two? That would tell you if pandas has any chance of reading them differently. – user1558604 Nov 30 '19 at 00:48
  • 1
    Perhaps that is equivalent to my question, since pandas uses csv module under the hood: "How to get csv module to read empty field and empty string differently?" (note, I haven't tried it yet) – justhalf Dec 02 '19 at 02:37
  • 1
    What pandas version are you using? This works as expected on 0.25. – cs95 Dec 09 '19 at 07:31
  • I'm using 0.25.1, Python 3, on Ubuntu. Can you clarify what you mean by works as expected? As in, it works as I want it to work in my question? – justhalf Dec 10 '19 at 04:56
  • 2
    There's one part of your question that is not standard, which is that you want to mix column datatypes. `NaN` is a floating point value; so the null string should not be `NaN`. I totally understand many use-cases for this behavior---on the other hand, it would be better to implement this in a way that is more definite (and that libraries will more likely support): have another column which tells you whether or not the other column is null. This will work regardless of datatype, and avoids having to use a valid value of your domain (empty string) as an indicator for null. – alkasm Dec 16 '19 at 05:24
  • Good point alkasm. I didn't really think about NaN being floating point, but you're right. I'm just wondering this because that's the output of BQ, so I thought there should be a way to read it properly. – justhalf Dec 16 '19 at 13:59
  • @alksam +1 with regard to floating point. @justhalf +1, please review the CSV specification. `,,` is equivalent to `,"",`. Quotation mostly used where value contains comma. When asking questions please explain the why. For example, what post processing are you doing that requires NaN and "" to be kept separate? – dank8 Mar 03 '23 at 02:15
  • @dank8 This was 4 years ago, haha. I don't remember the use case already. But I believe treating null vs empty string as different is not a niche use case? Distinguishing "missing data" and "data given is an empty string" is reasonable. See also the note at the end of my question about this being the output of BigQuery. – justhalf Mar 03 '23 at 06:22
  • @justhalf I looked and the whole no-data variations is a worm hole... Infinity, not provided, missing, not a number. Hope it was fun! – dank8 Mar 04 '23 at 07:16

4 Answers4

3

Another option would be do disable the quoting to get the fields where an empty string is present and where nothing is present. The problem in this case are the entries which include new line characters in the test. We would need to remove these chars first and merge the lines to create a new data file.

When reading the new data file with quoting off, empty values are NaN and empty strings are two quotes. This dataframe then can be used to set the NaN in the original dataframe to set the real NaNs.

import numpy as np
import pandas as pd

with open('./data.csv') as f:
    lines = f.readlines()

# merge lines where the comma is missing
it = iter(lines)
lines2 = [x if ',' in x else x + next(it) for x in it]
# replace \n which are not at the end of the line
lines3 = [l.replace('\n','') + '\n' for l in lines2]
# write new file with merged lines
with open('./data_merged.csv', 'w+') as f:
    f.writelines(lines3)


# read original data
df = pd.read_csv('./data.csv', na_filter=False)
# read merged lines data with quoting off
df_merged = pd.read_csv('./data_merged.csv', quoting=3)

# in df_merged dataframe if is NaN it is a real NaN
# set lines in original df to NaN when in df_merged is NaN
df.loc[df_merged.col.isna(), 'col'] = np.NaN
mjspier
  • 6,386
  • 5
  • 33
  • 43
  • Cool trick. This does seem to solve the problem, at the expense of twice the file size and twice the reading time. I still prefer a solution that does not require writing a copy of the data, but this seems like the best contender so far! Thanks mjspier. – justhalf Dec 13 '19 at 14:53
  • This seems to be the best answer so far (preserve the newlines instead of replacing them), although I'm still looking for better solution. – justhalf Dec 16 '19 at 06:49
  • Actually I think there might be a big problem with this solution. The merging of the lines might be not correct when a comma is in the text itself. I think replacing the empty string `""` at the beginning of a line might be a more stable solution. – mjspier Dec 16 '19 at 12:05
  • You are right in that aspect, but I think trying to handle newlines in quoted CSV correctly is itself already a CSV parser, so at this point the best answer is still "use QUOTE_NONE with some handling of newlines". And since yours came first, I gave the +50 to you. – justhalf Dec 16 '19 at 13:57
3

pandas.read_csv accepts a quoting argument that controls quoting behavior for each field. The argument accepts values of type int or csv.QUOTE_*. The latter are constants defined in the csv module. Of all the available options, the one to take note is csv.QUOTE_NONE. This constant instructs the reader object to perform no special processing of quote characters, which means that fields in double quotes are read as they are and no additional double quotes are added to the fields while parsing. The default value set by pandas is csv.QUOTE_MINIMAL.

In [237]: import csv
In [238]: import pandas as pd
In [239]: df = pd.read_csv("test.csv", quoting=csv.QUOTE_NONE)

In [240]: df
Out[240]: 
        col  val
0       "hi  NaN
1    there"  1.0
2       NaN  2.0
3       \f\  3.0
4        ""  4.0
5  """hi"""  5.0

With no special quoting, null values are parsed as NaN and empty strings with double-quotes are left as they are.

But there is a problem with this approach: if any field contains newlines in double-quotes, they are treated as separate strings. This is evident in the first line in the csv file where "hi\nthere" are parsed in separate rows by pandas. To get around this problem, I first performed some pre-processing with the re module. This was required to replace any newline characters in double-quote strings to whitepace. Then I wrote back to the same file and used it again as above in read_csv. As I'm not aware of the full format of your data, there may be more regex required as necessary. However, for the given problem, I get the desired output.

In [314]: with open("test.csv", 'r+') as f:
     ...:     data = f.read()
     ...:     import re
     ...:     pattern = re.compile(r'".*?"', re.DOTALL)
     ...:     data = pattern.sub(lambda x: x.group().replace('\n', ' '), data)
     ...:     f.seek(0)
     ...:     f.write(data)

In [315]: df = pd.read_csv("test.csv", quoting=csv.QUOTE_NONE)

In [316]: df
Out[316]: 
          col  val
0  "hi there"    1
1         NaN    2
2         \f\    3
3          ""    4
4    """hi"""    5
amanb
  • 5,276
  • 3
  • 19
  • 38
  • I *think* if you pass in a file object instead of a path that is opened with `newlines=''` as recommended in the `csv` module docs, you won't have the problem with newlines in quoted strings. I.e. `with open("test.csv", "r+", newlines="") as f` and `pd.read_csv(f, ...)`. – alkasm Dec 15 '19 at 20:59
  • If what alkasm said is true, that would be a good solution, just need to do post-processing to all fields to remove initial and end quotes, then replace consecutive double quotes with one double quote. – justhalf Dec 16 '19 at 02:34
  • Dang! Well, you could read it into a list of rows first with the standard `csv` module and then construct the dataframe, which works but you lose all the other `pd.read_csv()` keywords. – alkasm Dec 16 '19 at 05:14
1

Here's a bit ugly but complete answer:

import io
import re
import pandas as pd

with open('overflow.csv', 'r') as f:
    with io.StringIO(re.sub(r'(^"",)', "EMPTY_STR,", f.read(), flags=re.MULTILINE)) as ff:
        with io.StringIO(re.sub(r'(,"",)', ",EMPTY_STR,", ff.read(), flags=re.MULTILINE)) as fff:
            with io.StringIO(re.sub(r'(,""$)', ",EMPTY_STR", fff.read(), flags=re.MULTILINE)) as ffff:
                df = pd.read_csv(ffff)

df= df.replace('EMPTY_STR', '')

The re.sub() replaces the empty string with EMPTY_STR which can later be replaced back with an actual empty string. It has to be called three times for all three possible types of occurrences (beginning, middle and and of the line).

Truly empty cells are left alone and indeed interpreted as NaN.

gosuto
  • 5,422
  • 6
  • 36
  • 57
  • With regex solution my worry is that it might replace not just the intended one, but also others, because quotes inside the field are also escaped as two double quotes. For example, if the first field is `a,",b`, and the second one is `cd`, then in the CSV it will be `"a,"",b",cd`, so your middle regex matches this one even though it is actually part of the field. OTOH, even though this answer has its weakness, I can definitely check whether my data contain such construction. (btw, I believe there should be a way to use only 1 regex) – justhalf Dec 10 '19 at 05:01
0

Is there any way for you to replace the empty strings with something else when creating the BigQuery csv export? Like replace "" with "EMPTY_STR"? You could then use a converter function to replace those back to an empty string when using .read_csv().

gosuto
  • 5,422
  • 6
  • 36
  • 57