0

I have a text file and I want to remove lines that contain certain characters completely. For example a text file like this where I want lines that contain Chinese characters to be removed:

A.我不要这些汉字
Ok I see
有人会懂我写的吗?
Why not then?
我看够呛。
This is just an example

$myfile = "somtext.txt";
$handle = fopen($myfile, "r");
$book = fread($handle, filesize($myfile));
fclose($handle);

$book = preg_replace("/\p{Han}+/u","", $book);

echo nl2br($book);

But with this code the Chinese gets deleted ok but the punctuation is left and any alphanumeric characters are left on the line. Moreover, the line itself is still there. It ends up like this:

A.
Ok I see
?
Why not then?
。
This is just an example

But I need it to look like this:

Ok I see
Why not then?
This is just an example

EDIT: I want to do this before converting it to an array.

Hasen
  • 11,710
  • 23
  • 77
  • 135

4 Answers4

2

You wrote you want to create an array of lines once the unwanted parts of your file removed. But you can build it according to the loaded (and acceptable) lines. This way you don't have to store in memory the lines you don't want.

To do that, you have to write a generator that yields only correct lines:

function getLine($handle, $buffer = 2048, $sep = "\n") {
    while ( false !== $line = stream_get_line($handle, $buffer, $sep) ) {
        if ( preg_match('~^\P{Han}+$~u', $line) )
            yield $line;
    }
}

$myfile = "somtext.txt";

if ( false === $handle = fopen($myfile, "r") )
    throw new Exception("Unable to open file '$myfile'\n");

$result = iterator_to_array(getLine($handle));

fclose($handle);

print_r($result);
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
1

(Note- requires the Unicode flag)

You could put in all the properties for all the CJK_xxx blocks.
The sum of all the blocks should give you a script for most all Asian characters.
It's not like there is a separate script for Chinese, Japanese, Korean, Vietnamese.
These characters intermingle between languages. I think just Han might not be enough.

Also, the thing about blocks is that some unused codepoints within them
are reserved for future expansion, but the block reference doesn't change.
It is a way to future-proof Unicode updates where engine writers may not
update their scripts (which contain no unused codepoints) in a timely fashion.

This is just a class to match a single character.

See this http://www.unicode.org/faq/han_cjk.html for info on what CJK stands for.

To find/remove the line, use (?m)^.*?[class].*(?:\r?\n|\z)

[\p{Block=Kangxi_Radicals}\p{Block=CJK_Compatibility}\p{Block=CJK_Compatibility_Forms}\p{Block=CJK_Compatibility_Ideographs}\p{Block=CJK_Compatibility_Ideographs_Supplement}\p{Block=CJK_Radicals_Supplement}\p{Block=CJK_Strokes}\p{Block=CJK_Symbols_And_Punctuation}\p{Block=CJK_Unified_Ideographs}\p{Block=CJK_Unified_Ideographs_Extension_A}\p{Block=CJK_Unified_Ideographs_Extension_B}\p{Block=CJK_Unified_Ideographs_Extension_C}\p{Block=CJK_Unified_Ideographs_Extension_D}\p{Block=CJK_Unified_Ideographs_Extension_E}\p{Block=Enclosed_CJK_Letters_And_Months}]

or, just use the codepoint ranges.
This is for Unicode 9.

[\x{2E80}-\x{2FDF}\x{3000}-\x{303F}\x{31C0}-\x{31EF}\x{3200}-\x{4DBF}\x{4E00}-\x{9FFF}\x{F900}-\x{FAFF}\x{FE30}-\x{FE4F}\x{20000}-\x{2A6DF}\x{2A700}-\x{2CEAF}\x{2F800}-\x{2FA1F}]

0

Here is a regex that will do what you're looking for:

.+?(?=[^\x00-\x7F]).*(?=\n)\n

If you want an example: https://regex101.com/r/U6ngPi/2

EDIT (Explanation):

Starting on a line, look for all characters if ahead there is a non-ASCII character (.+?(?=[^\x00-\x7F]), taken from here)

Match all characters (.*)

Check for the existence of a newline, using similar logic as before for with lookaheads ((?=\n)\n) and then include the newline character

andrew
  • 448
  • 7
  • 13
0

Try this string as the matching one "/(.)\p{Han}+(.)\n/uD"

It captures other (non Chinese) characters as well as the newline at the end.

Rob Anthony
  • 1,743
  • 1
  • 13
  • 17