0

I actually got a file with ton of lines (almost 1M) as below.

736206  " 8214152  "    ""  
736207  "7357074"   ""  
736202  "7904815"   "TEST"  
736203  "8117263"   "TEST"  
736204  "8117263"   "TEST"  
736205  "9074391"   ""  
736221  "8308161"   ""  
736214  "7707114"   ""  
736229  "8215534"   ""  
736242  "9572006"   ""  
736255  "8418162"   ""  
736222  "7347835"   ""  
736230  "9044748"   "TROLL,A"   1999-01-01 00:00:00

I need to put in String[] or List each element without blank, space, tab, etc... like :

736230  
9044748
TROLL,A
1999-01-01 00:00:00

I am not good as regex but I tried some... It is actually an epic fail.

"\"([^\"]*)\"" ---
"\"([a-z\\s]+)\"" ---
^[^\"]*\"|\"[^\"]*$ ---

Nothing seems to work.

mehdmehd
  • 31
  • 1
  • 6

2 Answers2

0

You may want to read line by line and apply a pattern like this one

[a-z0-9A-Z,-:]+([ ]{1}|)[a-z0-9A-Z,-:]+

example

Regex example pattern

Freddy
  • 779
  • 1
  • 6
  • 20
-1

Given your input data:

    String data = "736206  \" 8214152  \"    \"\"  \n"
            + "736207  \"7357074\"   \"\"  \n"
            + "736202  \"7904815\"   \"TEST\"  \n"
            + "736203  \"8117263\"   \"TEST\"  \n"
            + "736204  \"8117263\"   \"TEST\"  \n"
            + "736205  \"9074391\"   \"\"  \n"
            + "736221  \"8308161\"   \"\"  \n"
            + "736214  \"7707114\"   \"\"  \n"
            + "736229  \"8215534\"   \"\"  \n"
            + "736242  \"9572006\"   \"\"  \n"
            + "736255  \"8418162\"   \"\"  \n"
            + "736222  \"7347835\"   \"\"  \n"
            + "736230  \"9044748\"   \"TROLL,A\"   1999-01-01 00:00:00";

Let us remove all the double quotes from data like so:

data = data.replace("\"", "");

If you print data to console, data will be:

736206   8214152        
736207  7357074     
736202  7904815   TEST  
736203  8117263   TEST  
736204  8117263   TEST  
736205  9074391     
736221  8308161     
736214  7707114     
736229  8215534     
736242  9572006     
736255  8418162     
736222  7347835     
736230  9044748   TROLL,A   1999-01-01 00:00:00

Now you can see that each individual piece of information that you are trying to isolate is separated by two or more spaces. We can use this cue and regex to convert this to an array of string like so:

String[] split = data.split("(\\s){2,}");

(\\s){2,} searches data to find instances where there are two or more successive space characters and splits it there.

The final output:

736206
8214152
736207
7357074
736202
7904815
TEST
736203
8117263
TEST
736204
8117263
TEST
736205
9074391
736221
8308161
736214
7707114
736229
8215534
736242
9572006
736255
8418162
736222
7347835
736230
9044748
TROLL,A
1999-01-01 00:00:00

With these two basic operations, you will be able to solve the problem rather than using a complicated regex.

Raqib
  • 1,367
  • 11
  • 24
  • Ho! Thanks mate for your reply. It seems to work good but some times, double spaces specification (\\s){2,} does'n work correctly even if space seems to be greater than "1". Then, I have some result like [xxx xxx] instead of [xxx, xxx] Really good idea mate, thank you, I need to push further to find a solution for this case with "one" space – mehdmehd Apr 11 '18 at 15:06
  • Can you provide more examples of input data where it does not work? – Raqib Apr 11 '18 at 15:10
  • Hardly see the resemblance to the "so called AKA duplicate question" ! – SAMUEL MARCHANT Apr 11 '18 at 23:18
  • @raqib some data are separate with only one white space. Actually not in my previous example. Other example : 736230_"9044748"_ _ _"TROLL_A"_ _ _1999-01-01 00:00:00 We have this case between (736230) & ("9044748") and between ("TROLL) & (A") The, after your solution, we will find : - 736230 9044748 - TROLL - A - 1999-01-01 00:00:00 – mehdmehd Apr 12 '18 at 07:31
  • ```data = data.replace("\"", ""); data = data.replaceAll("[0-9]+_[0-9]", " "); data = data.replace("_", " "); String[] split = data.split("(\\s{2,})");``` – Raqib Apr 12 '18 at 13:14
  • @mehdmehd: I think you should come up with a strategy to work with the data you have. As you know that the data is inconsistent and each row is not uniform in terms of formatting, you need to first clean the data to have a standard format for all rows. If you are able to achieve this part, the rest should be pretty straight forward. – Raqib Apr 12 '18 at 13:17
  • You'r right raqib ! Thank's for your help, I still get great advices to split my string after I clean my data! – mehdmehd Apr 13 '18 at 08:24