3

Inside my data flow pipeline I would like to add a derived column and its datatype is array. I would like to split the existing column with 1000 characters without breaking words. I think we can use regexSplit,

regexSplit(<string to split> : string, <regex expression> : string) => array

But I do not know which regular expression I can use for split the existing column without breaking words. Please help me to figure it out.

Syam Kumar
  • 343
  • 5
  • 16
  • By "split the existing column with 1000 characters," do you mean the total column length is currently 1,000 characters and you want to split it in half, or do you want each array element to be 1,000 characters? – jdaz Jul 05 '20 at 22:34
  • Yes @jdaz, I want each array element to be 1000 characters. If the 1000th character is middle of the word, the length of the array element is less than 1000. – Syam Kumar Jul 06 '20 at 05:50

2 Answers2

1

I created a workaround for this and it works fine for me.

filter(split(regexReplace(regexReplace(text, `[\t\n\r]`, ``), `(.{1,1000})(?:\s|$)`, `$1~~`), '~~'), #item !="")

I think, we have a better solution than this.

Syam Kumar
  • 343
  • 5
  • 16
0

I wouldn't use a regex for this, but a truncating function like this one, courtesy of TimS:

public static string TruncateAtWord(this string input, int length)
{
    if (input == null || input.Length < length)
        return input;
    int iNextSpace = input.LastIndexOf(" ", length, StringComparison.Ordinal);
    return string.Format("{0}…", input.Substring(0, (iNextSpace > 0) ? iNextSpace : length).Trim());
}

Translated into expression functions it would look something* like this.

substring(Input, 1, iif(locate(Input, ' ', 1000) > 0, locate(Input, ' ', 1000) , length(Input)) )

Since you don't have a lastIndexOf available as an expression function, you would have to default to locate, which means that this expression truncates the string at the first space after the 1000th character.

*I don't have an environment where I can test this.

gijswijs
  • 1,958
  • 19
  • 24
  • Using `regexSplit` treats matched regex as delimiters that are removed. So I believe any extra letters matched with `\S*` would be lost. – jdaz Jul 06 '20 at 06:10
  • @gijswijs we can't write code inside the column derived schema modifier in data factory data flow. Only possibility is write the regular expressions. I have checked with your regular expression, but it did not work for me. – Syam Kumar Jul 06 '20 at 06:13
  • The only option is like regexSplit(text, `/.{0,50}\S*(?:\s|$)/m`) – Syam Kumar Jul 06 '20 at 06:21
  • What about `(?<=.{1000})\s`? If Azure applies that iteratively, it would work, but if it applies it all at once on the original string it won't. – jdaz Jul 06 '20 at 06:30
  • @SyamKumar Do you have this overload available? https://learn.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.regex.split?view=netcore-3.1#System_Text_RegularExpressions_Regex_Split_System_String_System_Int32_System_Int32_ You could use the startAt int to specify the length and then search for the first \s. – gijswijs Jul 06 '20 at 06:46
  • It seems you don't have that overload available to you in Azure Data Factory. So I don't think it can be done, but maybe @jdaz suggestions is worth a try. – gijswijs Jul 06 '20 at 06:50
  • @syamkumar Updated my answer with a non-regex option using Expression functions. – gijswijs Jul 06 '20 at 07:12
  • @gijswijs I have checked with the substring option. But bad luck for me – Syam Kumar Jul 06 '20 at 07:29
  • Sorry to hear that. I still think the solutions should be something with substring and locate, though. – gijswijs Jul 06 '20 at 07:44