0

Loading in the data

in: import pandas as pd

in: df = pd.read_csv('name', sep = ';',  encoding='unicode_escape')

in : df.dtypes
out: amount     object

I have an object column with amounts like 150,01 and 43,69. Thee are about 5,000 rows.

df['amount']
0           31
1       150,01
2           50
3         54,4
4        32,79
         ...  
4950      25,5
4951      39,5
4952     75,56
4953       5,9
4954     43,69
Name: amount, Length: 4955, dtype: object

Naturally, I tried to convert the series into the locale format, which suppose to turn it into a float format. I came back with the following error:

In: import locale 
    setlocale(LC_NUMERIC, 'en_US.UTF-8')
Out: 'en_US.UTF-8'
In: df['amount'].apply(locale.atof)
Out: ValueError: could not convert string to float: ' -   '

Now that I'm aware that there are non-numeric values in the list, I tried to use isnumeric methods to turn the non-numeric values to become NaN.

Unfortunately, due to the comma separated structure, all the values would turn into -1.


0      -1
1      -1
2      -1
3      -1
4      -1
       ..
4950   -1
4951   -1
4952   -1
4953   -1
4954   -1
Name: amount, Length: 4955, dtype: int64

How do I turn the "," values to "." by first removing the "-" values? I tried .drop() or .truncate it does not help. If I replace the str",", " ", it would also cause trouble since there is a non-integer value.

Please help!

Documentation that I came across

-https://stackoverflow.com/questions/21771133/finding-non-numeric-rows-in-dataframe-in-pandas

-https://stackoverflow.com/questions/56315468/replace-comma-and-dot-in-pandas

p.s. This is my first post, please be kind

W. MacTurk
  • 130
  • 1
  • 15
  • 2
    Please provide sample data. – ddejohn Sep 29 '21 at 19:03
  • What are you trying to do, overall? I have a solution that worked for me, but I don't want to post it as a comment if it's not going to be helpful. – W. MacTurk Sep 29 '21 at 19:32
  • Welcome to [Stack Overflow.](https://stackoverflow.com/ "Stack Overflow")! It is very difficult to answer your question without seeing both the data and the code that you have written that produces your problem. Please read about how to ask a good question and try to post a [Minimal Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example "Minimal Reproducible Example") so we can better help you. – itprorh66 Sep 29 '21 at 19:32
  • Sorry all, hope my new edits is more clear. I aim to have all the values into float. – nonganalytics Sep 29 '21 at 19:36
  • Where is this dataframe coming from? Are you importing a csv file? And what do you mean by "-" values -- are some of the values negative, or are the entries literally just a hyphen? – W. MacTurk Sep 29 '21 at 19:42
  • @W.MacTurk Yes, I imported a ";" separated file, encoded through "unicode_escape". This is a transaction data, I believe there are no negative values. It is simply a hyphen perhaps. – nonganalytics Sep 29 '21 at 19:47
  • Provide a few sample lines from the original file and the code currently used to read it in – Mark Tolonen Sep 29 '21 at 19:52
  • If you're using anything newer than python 3.3 (or 3.6? can't remember which), you shouldn't be using `unicode_escape` at all. It's an outdated codec that isn't really used and hasn't been maintained. You should try to import values encoded as UTF-8 or even ASCII if at all possible. Also, once your csv has been stored in a dataframe, I believe the delimiter is irrelevant -- the csv values are no longer separated by a comma within the dataframe structure. – W. MacTurk Sep 29 '21 at 19:57
  • @W.MacTurk my Python version is 3.8.5 This dataset is from South America and I can't encode it in the two suggested formats. – nonganalytics Sep 29 '21 at 20:02
  • Try using chardet. `pip install chardet` to get the package, then in your import block, `import chardet`. When importing the file, do something like `with open("C:/path/to/file.csv", 'r') as f: || data = f.read() || result = chardet.detect(data.encode()) || charencode = result['encoding']' then set the handler back to the beginning and re-read the .csv: 'f.seek(0, 0) || data = pd.read_csv(f, delimiter=";", encoding=charenc)` (sorry about that '||'s, had to come up with a way to designate newlines in comments. everything should be singly indented under the `with` statement. – W. MacTurk Sep 29 '21 at 20:08
  • @W.MacTurk comments are not an appropriate place for multi-line code. Post your comment as an answer. – ddejohn Sep 29 '21 at 20:10
  • Note also that often encoding errors when reading .csv files with weird/uncommon encodings can be solved without any additional packages by setting `engine='python'` as a parameter, as in `data = pd.read_csv('C:/path/to/file.csv', engine='python')`. Don't ask me why or how it works :) – W. MacTurk Sep 29 '21 at 20:11
  • @ddejohn My apologies. I didn't want to post it as an answer because I had not yet answered OP's question. – W. MacTurk Sep 29 '21 at 20:13

2 Answers2

2

Sounds like you have a European-style CSV similar to the following. Provide actual sample data as many comments asked for if your format is different:

data.csv

thing;amount
thing1;31
thing2;150,01
thing3;50
thing4;54,4
thing5;1.500,22

To read it, specify the column, decimal and thousands separator as needed:

import pandas as pd

df = pd.read_csv('data.csv',sep=';',decimal=',',thousands='.')
print(df)

Output:

    thing   amount
0  thing1    31.00
1  thing2   150.01
2  thing3    50.00
3  thing4    54.40
4  thing5  1500.22
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Hi Mark, Firstly thanks. I tried your method but it does not work unless I use encoding = 'unicode_escape'. Furthermore, the decimal and thousands does not do anything on my dataset. – nonganalytics Sep 29 '21 at 20:11
  • --------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert() pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8() UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 5: invalid start byte – nonganalytics Sep 29 '21 at 20:18
  • During handling of the above exception, another exception occurred: UnicodeDecodeError Traceback (most recent call last) in ----> 1 df = pd.read_csv('base_poupacerto_20200519.csv',sep=';',decimal=',',thousands='.') – nonganalytics Sep 29 '21 at 20:18
  • ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, – nonganalytics Sep 29 '21 at 20:19
  • warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision) 684 ) 685 --> 686 return _read(filepath_or_buffer, kwds) 687 688 ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds) 456 457 try: --> 458 data = parser.read(nrows) 459 finally: 460 parser.close() ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in read(self, nrows) 1194 def read(self, nrows=None): 1195 nrows = _validate_integer("nrows", nrows) – nonganalytics Sep 29 '21 at 20:20
  • -> 1196 ret = self._engine.read(nrows) 1197 1198 # May alter columns / col_dict ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in read(self, nrows) 2153 def read(self, nrows=None): 2154 try: -> 2155 data = self._reader.read(nrows) 2156 except StopIteration: 2157 if self._first_chunk: pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory() – nonganalytics Sep 29 '21 at 20:21
  • pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype() pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert() pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8() UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 5: invalid start byte – nonganalytics Sep 29 '21 at 20:21
  • @nonganalytics Again, edit your question and post a [mcve] with sample input data **as an input text file, not a Dataframe print**) and the code to read your data that you have tried. I guessed because that information is missing from your question. You'll get better help if you *be specific about how to reproduce the issue*. You can add `encoding='unicode-escape'` but that is about the worst way to export data and you didn't give a sample. – Mark Tolonen Sep 29 '21 at 20:57
0

Posting as an answer since it contains multi-line code, despite not truly answering your question (yet):

Try using chardet. pip install chardet to get the package, then in your import block, add import chardet.

When importing the file, do something like:

with open("C:/path/to/file.csv", 'r') as f:
    data = f.read()
    result = chardet.detect(data.encode())
    charencode = result['encoding']

    # now re-set the handler to the beginning and re-read the file:
    f.seek(0, 0)
    data = pd.read_csv(f, delimiter=';', encoding=charencode)

Alternatively, for reasons I cannot fathom, passing engine='python' as a parameter works often. You'd just do

data = pd.read_csv('C:/path/to/file.csv', engine='python')

@Mark Tolonen has a more elegant approach to standardizing the actual data, but my (hacky) way of doing it was to just write a function:

def stripThousands(self, df_column):
    df_column.replace(',', '', regex=True, inplace=True)
    df_column = df_column.apply(pd.to_numeric, errors='coerce')
    return df_column

If you don't care about the entries that are just hyphens, you could use a function like

def screw_hyphens(self, column):
    column.replace(['-'], np.nan, inplace=True)

or if np.nan values will be a problem, you can just replace it with column.replace('-', '', inplace=True)

**EDIT: there was a typo in the block outlining the usage of chardet. it should be correct now (previously the end of the last line was encoding=charenc)

W. MacTurk
  • 130
  • 1
  • 15
  • Thanks for helping! This error occured with the first method: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 253: invalid start byte Sadly, the engine method did not work. I'll try to function route! – nonganalytics Sep 29 '21 at 20:33
  • When attempting which method, the `engine='python'` or the `chardet`? This is proving to be a difficult nut to crack. If you want to send me the file in question I'll see if I can put something together for you. By the way, is this just one file you need to work with, or will you need to turn this into a working script and process .csv files with weird encodings often? – W. MacTurk Sep 29 '21 at 20:36
  • Just one file! How can I forward you this? Thanks. – nonganalytics Sep 29 '21 at 20:37
  • (I hope I'm not breaking any rules by posting contact info) -- just email it to wmacturk@hivresearch.org. Alternatively, upload it to something like google drive and post a link – W. MacTurk Sep 29 '21 at 20:40