4

I have a column in my pandas Dataframe df that contains a string with some trailing hex-encoded NULLs (\x00). At least I think that it's that. When I tried to replace them with:

df['SOPInstanceUID'] = df['SOPInstanceUID'].replace('\x00', '')

the column is not updated. When I do the same with

df['SOPInstanceUID'] = df['SOPInstanceUID'].str.replace('\x00', '')

it's working fine. What's the difference here? (SOPInstanceUID is not an index.)

thanks

landge
  • 165
  • 2
  • 10

2 Answers2

10

The former looks for exact matches, the latter looks for matches in any part of the string, which is why the latter works for you.

The str methods are synonymous with the standard string equivalents but are vectorised

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Not OP but thank you for the info. Just a silly question, what you mean by vectorised here? – Bowen Liu Oct 24 '18 at 20:16
  • @BowenLiu vectorised here means instead of operating on a single row or value at a time, we operate on the entire column (although in practice it really means multiple values) so it's significantly faster – EdChum Oct 25 '18 at 16:11
  • Thanks a lot your explanation. So it can operate on multiple values at once so it can save computation time? – Bowen Liu Oct 31 '18 at 14:01
  • @BowenLiu correct vectorization is in my opinion why you should be using numpy or pandas. Otherwise it's just a fancy data structure that makes indexing easier without any performance gain – EdChum Oct 31 '18 at 14:03
  • Amazing! I never thought about the reasons behind using pandas and numpy for data handling. I just use it because everyone uses it and it has so many useful functions. But the reason for these functions to work well and fast is that they vectorize all the data? Could you explain in layman's terms how it could do it please? I always thought it iterates through objects one by one just like for loops. – Bowen Liu Oct 31 '18 at 14:08
  • lots of articles on this: https://hackernoon.com/speeding-up-your-code-2-vectorizing-the-loops-with-numpy-e380e939bed3 https://stackoverflow.com/questions/47755442/what-is-vectorization https://datascience.blog.wzb.eu/2018/02/02/vectorization-and-parallelization-in-python-with-numpy-and-pandas/ https://realpython.com/numpy-array-programming/. This is why storing non-scalar values is counter-productive, I don't understand why people store lists and other array like structures in a pandas dataframe, you lose vectorization – EdChum Oct 31 '18 at 14:12
  • Thanks for the link. I'm still trying to wrap my heads around what you said. So when you said "non-scalar", that means that objects that hold more than one individual value? And when you people store "lists and other array like structures in a pandas dataframe", how do they do that? In each cell of the dataframe, instead of storing a scalar value, namely a single value, they store a list or array? Sorry for being slow. Just trying to make sure I understand. Thanks again – Bowen Liu Oct 31 '18 at 18:16
  • @BowenLiu yes that is correct, you can google this. SO isn't a forum so continuous chatting is counter-productive. – EdChum Oct 31 '18 at 21:35
2

You did not specify a regex or require an exact match, hence str.replace worked

str.replace(old, new[, count])

Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.

DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad', axis=None)

parameter: to_replace : str, regex, list, dict, Series, numeric, or None

str or regex: str: string exactly matching to_replace will be replaced with value regex: regexs matching to_replace will be replaced with value

They're not actually in the string: you have unescaped control characters, which Python displays using the hexadecimal notation:

remove all non-word characters in the following way:

re.sub(r'[^\w]', '', '\x00\x00\x00\x08\x01\x008\xe6\x7f')
SerialDev
  • 2,777
  • 20
  • 34
  • Ok, thanks to both of you. But when I call replace like this `code`df['SOPInstanceUID'].replace('\x00', '')`code` I get the string back without trailing NULLs!? So, it seems to match, or is it just som kind of output formatting that doesn't show the NULLs? – landge Jun 30 '16 at 08:14
  • you'll need to post raw data and code that demonstrates this, also your comment contradicts your question statement in that it didn't work – EdChum Jun 30 '16 at 08:17
  • Yes, sorry. I ment when I call the method without assigning back to the column I get a string output in jupyter without the trailing NULLs. When assigning as in my post - nothing happens. Confusing. – landge Jun 30 '16 at 08:22
  • CMari, thanks. That was the missing part! I don't understand it thoroughly, but I'll try. – landge Jun 30 '16 at 08:58