1

In our project, we move the data from tables on RDBMS to HDFS using Scala and Spark. Before moving the data, we apply a "regex_replace" on the data to eliminate some discrepancies in the data. Below is the regex_replace:

regexp_replace(
    regexp_replace(
        regexp_replace(
            regexp_replace(
                regexp_replace(..., E'[\\n]+', ' ', 'g' ),
                E'[\\r]+', ' ', 'g'
            ),
            E'[\\t]+', ' ', 'g'
        ),
        E'[\\cA]+', ' ', 'g'
    ),
    E'[\\ca]+', ' ', 'g'
)

What is the meaning of the E that preceeds the single quoted strings in each regexp_replace call?

jwvh
  • 50,871
  • 7
  • 38
  • 64
Bobby
  • 35
  • 1
  • 2
  • 12
  • 2
    Language tag please – Mad Physicist Aug 21 '18 at 15:20
  • 2
    What kind of regex flavour/dialect or language/environment/database are you talking about exactly?! – deceze Aug 21 '18 at 16:06
  • I have a hunch that he's using Java. With a string for a PostgreSQL statement. If so, that `E` is probably to make it verbatim string. So that `E` is not even part of a regex. [example SO post](https://stackoverflow.com/a/938/4003419). – LukStorms Aug 21 '18 at 19:08
  • I am using the Regex expression in Scala. – Bobby Aug 22 '18 at 05:17
  • @MadPhysicist Added the language tag too. – Bobby Aug 22 '18 at 05:19
  • @LukStorms I am trying to use this regex pattern in the language Scala. – Bobby Aug 22 '18 at 18:10
  • I don't know Scala, nor Apache Spark. But it seems that all those (SQL for a PostgreSql database?) `replace_regexp` can be replaced by a single one. F.e.: `regexp_replace(..., E'[\\cA\\ca\\t\\r\\n]+',' ','g')` But then again, that might result in less spaces. – LukStorms Aug 22 '18 at 21:56
  • @LukStorms Yes. It is for PostgreSql – Bobby Aug 23 '18 at 06:07
  • Then the E is to make it an [escape string constant](https://stackoverflow.com/questions/34823158/whats-the-e-before-a-postgres-string) – LukStorms Aug 23 '18 at 06:26

1 Answers1

2

The 'E' is part of the SQL syntax, at least for PostgreSql.
It's used in the SQL to be able to escape characters with a backslash \.

And if such SQL string is put in a scala string (that isn't a raw string), then the backslashes need to be backslashed to get literal backslashes.

So to make the regex in the regex_replace function to see only one literal backslash?

val sqlstring = "select regexp_replace(col1, E'\\\\s+','','g') as col1_without_whitespace from yourtable"

Is equivalent to:

val sqlstring = "select regexp_replace(col1, '\\s+','','g') as col1_without_whitespace from yourtable"

Is equivalent to:

val sqlstring = raw"select regexp_replace(col1, '\s+','','g') as col1_without_whitespace from yourtable"

A SQL Fiddle test for that 'E' can be found here

LukStorms
  • 28,916
  • 5
  • 31
  • 45