How to split the column with same delimiter

Question

My dataframe is this and I want to split my data frame by colon (:)

+------------------+
|Name:Roll_no:Class|
+------------------+
|      #ab:cd#:23:C|
|      #sd:ps#:34:A|
|      #ra:kh#:14:H|
|      #ku:pa#:36:S|
|      #ra:sh#:50:P|
+------------------+

and I want my dataframe like:

+-----+-------+-----+
| Name|Roll_no|Class|
+-----+-------+-----+
|ab:cd|     23|    C|
|sd:ps|     34|    A|
|ra:kh|     14|    H|
|ku:pa|     36|    S|
|ra:sh|     50|    P|
+-----+-------+-----+

have you tried using `sep=':'` during `read_csv` – Yash Kumar Atri Jun 07 '19 at 11:52 — Yash Kumar Atri, Jun 07 '19 at 11:52
yes it will split name too – Cool Triks Jun 07 '19 at 11:54 — Cool Triks, Jun 07 '19 at 11:54
Are you asking for pandas or pyspark? – pault Jun 07 '19 at 14:20 — pault, Jun 07 '19 at 14:20

score 5 · Answer 1 · answered Jun 07 '19 at 11:54

If need split by last 2 : use Series.str.rsplit, then set columns by split column name and last remove first and last # by indexing:

col = 'Name:Roll_no:Class'
df1 = df[col].str.rsplit(':', n=2, expand=True)
df1.columns = col.split(':')
df1['Name'] = df1['Name'].str[1:-1]
#if only first and last value
#df1['Name'] = df1['Name'].str.strip('#')
print (df1)
    Name Roll_no Class
0  ab:cd      23     C
1  sd:ps      34     A
2  ra:kh      14     H
3  ku:pa      36     S
4  ra:sh      50     P

score 3 · Answer 2 · answered Jun 07 '19 at 12:20

3

Use read_csv() sep=':' and quotechar='#'

str = """Name:Roll_no:Class 
#ab:cd#:23:C 
#sd:ps#:34:A 
#ra:kh#:14:H 
#ku:pa#:36:S 
#ra:sh#:50:P"""

df = pd.read_csv(pd.io.common.StringIO(str), sep=':', quotechar='#')
>>> df
     Name  Roll_no Class
#0  ab:cd       23     C
#1  sd:ps       34     A
#2  ra:kh       14     H
#3  ku:pa       36     S
#4  ra:sh       50     P

answered Jun 07 '19 at 12:20

jxc

13,553
4
16
34

@CoolTriks for pyspark, you can do the similar: `df = spark.read.csv('/file/path', quote='#', sep=':', header=True)`. Or for existing field, use two functions.split() (first by '#' and then by ':') to retrieve needed fields. – jxc Jun 07 '19 at 19:30

score 0 · Answer 3 · answered Jun 07 '19 at 20:53

This is how you could do this in pyspark:

Specify the separator and the quote on read

If you're reading the data from a file, you can use spark.read_csv with the following arguments:

df = spark.read.csv("path/to/file", sep=":", quote="#", header=True)
df.show()
#+-----+-------+-----+
#| Name|Roll_no|Class|
#+-----+-------+-----+
#|ab:cd|     23|    C|
#|sd:ps|     34|    A|
#|ra:kh|     14|    H|
#|ku:pa|     36|    S|
#|ra:sh|     50|    P|
#+-----+-------+-----+

Use Regular Expressions

If you're unable to change the way the data is read and you're starting with the DataFrame shown in the question, you can use regular expressions to get the desired output.

First get the new column names by splitting the existing column name on ":"

new_columns = df.columns[0].split(":")
print(new_columns)
#['Name', 'Roll_no', 'Class']

For the Name column you need to extract the data between the #s. For the other two columns, you need to remove the strings between the #s (and the following ":") and use pyspark.sql.functions.split to extract the components

from pyspark.sql.functions import regexp_extract, regexp_replace, split

df.withColumn(new_columns[0], regexp_extract(df.columns[0], r"(?<=#).+(?=#)", 0))\
    .withColumn(new_columns[1], split(regexp_replace(df.columns[0], "#.+#:", ""), ":")[0])\
    .withColumn(new_columns[2], split(regexp_replace(df.columns[0], "#.+#:", ""), ":")[1])\
    .select(*new_columns)\
    .show()
#+-----+-------+-----+
#| Name|Roll_no|Class|
#+-----+-------+-----+
#|ab:cd|     23|    C|
#|sd:ps|     34|    A|
#|ra:kh|     14|    H|
#|ku:pa|     36|    S|
#|ra:sh|     50|    P|
#+-----+-------+-----+

How to split the column with same delimiter

3 Answers3