0

I am working on decreasing the size of text data.

Example input:

example@EXAMPLE.com;example
example@EXAMPLE.com:exmaple

Example output:

example@example.com;example
example@example.com:exmaple

Pseudo code:

if line has "@" and ":" or ";"
replace the text between @ and : or ; with lowercase

But I have no idea even what tools to use. Any help is welcome.

John W
  • 1

2 Answers2

0

Use the sed tool for resolve this problem,

cat input_file.txt | sed -e 's/\(.*@\)\([A-Za-z.]\+\)\([;:].*\)/\1\L\2\3/' 

Regex Explanation:

\(.*@\) - This pattern matching "example@"

\([A-Za-z.]\+\) - This pattern matching "EXAMPLE.com"

\([;:].*\) - This pattern matching ":exmaple" or ";exmaple"

The \L is changed to lowercase of the text

If you want to update the content use -i flag in sed command.

Ex:

sed -i -e 's/\(.*@\)\([A-Za-z.]\+\)\([;:].*\)/\1\L\2\3/' input_file.txt
Community
  • 1
  • 1
sprabhakaran
  • 1,615
  • 5
  • 20
  • 36
0

If you have a lot of data, awk will be faster than a shell. The sed solutions are fine, but this works too:

$: awk '-F[;:]' '{ printf "%s;%s\n", tolower($1), $2 }' x
example@example.com;exaMple
example@example.com;eXmaple
example@example.com;exAmple
example@example.com;exmaplE
example_example.com;Example
example_example.com;eXmaple
example@example.com,example;

That defines the -Field separators as a list of ;: and lowercases the first field. I arbitrarily replaced the delimiter with a standardized ; - if that doesn't work, this might not be the best solution for you. Stick with the sed.

sprabhakaran beat me to it with a practically identical sed solution while I was initially typing, lol. :)

sed can.

$: cat x
Example@EXAMPLE.cOm;exaMple
exampLe@EXAMPLE.coM:eXmaple
example@EXAMPLE.com;example
example@EXAMPLE.com:exmaple
example_EXAMPLE.com;example
example_EXAMPLE.com:exmaple
example@EXAMPLE.com,example

$: sed -E '/@.+[;:]/s/^(.*)@(.*)([;:])(.*)/\1@\L\2\E\3\4/' x
Example@example.com;exaMple
exampLe@example.com:eXmaple
example@example.com;exAmple
example@example.com:exmaplE
example_EXAMPLE.com;Example
example_EXAMPLE.com:eXmaple
example@EXAMPLE.com,examPle

\L says to begin lowercasing until \E (end) or \U (begin uppercasing).

This skips lines that don't have both @ and [;:] (either of ; or :.)

for small datasets native bash might be easier.

It might be a lot simpler however to just downcase the whole thing.

$: declare -l line
$: while read line
> do echo "$line"
> done < x
example@example.com;example
example@example.com:exmaple
example@example.com;example
example@example.com:exmaple
example_example.com;example
example_example.com:exmaple
example@example.com,example

declare -l make a variable always lowercase anything put in it.


Since case-sensitive passwords prevent that, parse the parts separately.

$: while IFS="$IFS:;" read email pass
> do echo "$email [$pass]"
> done < x
example@example.com [exaMple]
example@example.com [eXmaple]
example@example.com [exAmple]
example@example.com [exmaplE]
example_example.com [Example]
example_example.com [eXmaple]
example@example.com,example []

As long as the record is properly formatted it works great. I assume you can check for errors or trust your data.

Paul Hodges
  • 13,382
  • 1
  • 17
  • 36