Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multiline search, and for searching for 0x0A bytes. #400

Open
mc-butler opened this issue Jun 10, 2009 · 27 comments
Open

Support for multiline search, and for searching for 0x0A bytes. #400

mc-butler opened this issue Jun 10, 2009 · 27 comments
Assignees
Labels
area: search Search subsystem prio: medium Has the potential to affect progress

Comments

@mc-butler
Copy link

Important

This issue was migrated from Trac:

Origin https://midnight-commander.org/ticket/400
Reporter Hubbitus (@Hubbitus)

So, as I known, now search made line by line, and no chance match/replace pattern like "Some text\n+Another text".

This possibilities is highly appreciated.

Note

Original attachments:

@mc-butler
Copy link
Author

Changed by slavazanko (@slavaz) on Jun 11, 2009 at 9:30 UTC (comment 1)

  • Milestone changed from 4.7 to future releases

@mc-butler
Copy link
Author

Changed by sxmboer (sxmboer@….net) on Nov 18, 2009 at 16:50 UTC

  • Severity set to no branch

Replying to Hubbitus (#400):

So, as I known, now search made line by line, and no chance match/replace pattern like "Some text\n+Another text".

This possibilities is highly appreciated.

This is already possible by typing CTRL-Q, Enter for the newline
This works for search and search and replace.

@mc-butler
Copy link
Author

Changed by Hubbitus (@Hubbitus) on Nov 18, 2009 at 22:23 UTC (comment 3)

By CTRL+Q pressing "Insert Literal" dialog appeared.
mc-4.6.99.3-0.9.52.gd40065d

@mc-butler
Copy link
Author

Changed by mooffie (@mooffie) on Jun 11, 2015 at 14:24 UTC (comment 4)

  • Branch state set to no branch

No, one can't search over multiple lines in mcedit.

The search library searches line by line. See mc_search__run_regex() in lib/search/regex.c. There's a loop there that accumulates a line:

   while (TRUE) {
      ...
      if (current_chr == '\n' || ...)
        break;
   }

(You can find the last \n on a line, but this has no real benefit. BTW, you don't have to use CTRL-Q: if you choose "Regular expression" you can simply type the two characters "\n".)

mc2 comes with a snippet that lets one search over multiple lines.

@mc-butler
Copy link
Author

Changed by mooffie (@mooffie) on Sep 26, 2016 at 15:48 UTC (comment 5)

We can solve this by adding to the search dialog a "[x] Search over lines" checkbox what would turn off the "current_chr == '\n'" check I quoted in comment 4. This would make the text accumulate in one giant string instead of line-by-line (but we should set some limit in case one searches in a 20 terabyte file...).

If we detect a newline in the regexp, we'll search-over-lines automatically. But having a checkbox is good for the case one wants to search for things like "struct \{.*?\}", where it's a dot that stands for a newline.

@mc-butler
Copy link
Author

Changed by andrew_b (@aborodin) on Sep 26, 2016 at 17:15 UTC (comment 6)

  • Component changed from mcedit to mc-search

@mc-butler
Copy link
Author

Changed by andrew_b (@aborodin) on Sep 26, 2016 at 17:26 UTC (comment 7)

Ticket #3454 has been marked as a duplicate of this ticket.

@mc-butler
Copy link
Author

Changed by andrew_b (@aborodin) on Sep 26, 2016 at 17:28 UTC (comment 8)

  • Blocked by set to #3694

@mc-butler
Copy link
Author

Changed by mooffie (@mooffie) on Dec 7, 2016 at 21:05 UTC (comment 9)

  • Summary changed from Please add multiline search in mcedit to Support for multiline search, and for searching for 0x0A bytes.
  • Type changed from enhancement to defect

Since we can't find 0x0A bytes (in hex search), I'm changing this to 'defect'.

@mc-butler
Copy link
Author

Changed by mooffie (@mooffie) on Dec 7, 2016 at 21:15 UTC (comment 10)

Ok, here's a series of three small patches to fix this ticket:

What's in it?

(1) "introduce slurp mode"

This mode disables the "\n" chunking, which, as comment comment:4 explains, is the cause of all the problems. There's very little actual code here.

(2) "hex search: make it possible to search for 0x0A bytes."

This patch uses slurp mode to fix the hex search bug.

(3) "mcedit: make it possible to search across lines."

This patch uses slurp mode to implement a new feature: letting users search across lines. It adds an "Across lines" checkbox to the search and search-replace dialogs:

┌─────────────────────── Search ───────────────────────┐
│ Enter search string:                                 │
│ \{.*?\}                                          [^] │
├──────────────────────────────────────────────────────┤
│ ( ) Normal                [ ] Case sensitive         │
│ (*) Regular expression    [ ] Backwards              │
│ ( ) Hexadecimal           [ ] In selection           │
│ ( ) Wildcard search       [ ] Whole words            │
│                           [x] Across lines           │
│                           [ ] All charsets           │
├──────────────────────────────────────────────────────┤
│           [< OK >] [ Find all ] [ Cancel ]           │
└──────────────────────────────────────────────────────┘

(In comment:5 I also suggested enabling this feature automatically if the pattern has newlines in it, but I no longer think this "extra wisdom" it a good idea; it will just add confusion.)

What's next?

Andrew, if you're fine with this solution, let me know and I'll work on some refinements:

  • Work on the 'todo' mentioned inside patch (1).
  • Write help text for the "Across lines" checkbox. As Help for editor dialogs #1561 points out, there's no help page for the search box, so this will be done in a separate ticket.
  • Add the "Across lines" checkbox to the viewer's search dialog as well.
  • Automatically turn off the "Across lines" checkbox when the user clicks radio other than "Regular expression" (just so the user doesn't forget to turn it off). We may also disable it.

Performance

Initially I thought backwards-search, which is not implemented quite efficiently, would put the kibosh on slurp mode. Surprisingly, however, I didn't notice a substantial difference in performance between slurp and non-slurp mode when doing backwards-search. Neither was the memory strained.

Maintainability

There's very little code added, and the patches essentially modify only two lines --in regex.c. To me this looks like a safe solution.

@mc-butler
Copy link
Author

Changed by mooffie (@mooffie) on Dec 7, 2016 at 21:16 UTC

@mc-butler
Copy link
Author

Changed by mooffie (@mooffie) on Dec 7, 2016 at 21:17 UTC

@mc-butler
Copy link
Author

Changed by mooffie (@mooffie) on Dec 7, 2016 at 21:17 UTC

@mc-butler
Copy link
Author

Changed by zaytsev (@zyv) on Dec 8, 2016 at 6:23 UTC (comment 11)

"Across lines" sounds a bit weird to me, do you know how it is done/called in other editors? I quickly checked gedit and kate, and they don't seem to have such a checkbox, but if the pattern contains \n, they capture multiple lines by default... Is there a downside to this behavior?

@mc-butler
Copy link
Author

Changed by mooffie (@mooffie) on Dec 8, 2016 at 13:13 UTC (comment 11.12)

Replying to zaytsev:

"Across lines" sounds a bit weird to me


I had to pick something that matches the grammatical setting of the surrounding checkboxes: "[search] backwards", "[search] in selection", "[search] whole words", "[search] all charsets", and therefore "[search] across lines". I don't mind changing it once somebody comes up with an alternative.

I quickly checked gedit and kate, and they don't seem to have such a checkbox, but if the pattern contains \n, they capture multiple lines by default...


(These editors evidently aren't searching the text line-by-line like we do, so they don't face the challenge of having to detect when to turn off this behavior.)

if the pattern contains \n, they capture multiple lines by default... Is there a downside to this behavior?


First, let's see if we agree that we have to have a checkbox at all. Consider "struct \{.*?\}". We can't know that the user intends for the "." to match any char. So we have to let her tell us. A checkbox seems a straightforward way. (BTW, the regex world has a somewhat similar virtual checkbox: the //s flag (I say "somewhat" because MC unconditionally turns G_REGEX_DOTALL on).)

Once we agree that a checkbox is necessary, the bonus issue of automatically detecting \n can be blissfully handled independently, in a separate ticket.

if the pattern contains \n, they capture multiple lines by default... Is there a downside to this behavior?


I'm not against the idea. But then there's the question of how to correctly detect \n, and whether we'll surprise the user (i.e., "Begin.*\n.*End" won't match just two lines). We'll have discussions that have the potential to hold up the ticket for months and years.

So I think there are two possibilities:

(A) We can delegate the \n detection to a separate ticket.
(B) We can split this ticket into two: patches (1) and (2) will go back into ticket #3454, and hopefully get committed. Patch (3) will stay here and be discussed further.

@mc-butler
Copy link
Author

Changed by ossi (@ossilator) on Dec 26, 2016 at 13:05 UTC (comment 13)

"spanning lines" seems like a reasonable label.

defaulting to this behavior also seems reasonable. the user just needs to use "[^\n]" instead of ".".

@mc-butler
Copy link
Author

Changed by andrew_b (@aborodin) on Jul 28, 2022 at 4:39 UTC (comment 14)

Ticket #4395 has been marked as a duplicate of this ticket.

@mc-butler
Copy link
Author

Changed by sxmboer2 (sxmboer@….com) on Jul 29, 2022 at 2:31 UTC (comment 15)

I submitted a fix as ticket #4395, but I probably should have added a comment here and uploaded the bugfix by attaching the diff file here.
The diff file I am attaching fixes various search issue related to multi-line searches and replacements; both in normal mode as well as in regex search mode.
I changed the boundary conditions because testing in search and replaces showed that the search consistently found the very end of the file as a hit (when searching for "\n\n" even when only one newline exists at the end of the file buffer.

@mc-butler
Copy link
Author

Changed by sxmboer2 (sxmboer@….com) on Jul 29, 2022 at 2:33 UTC

@mc-butler
Copy link
Author

Changed by andrew_b (@aborodin) on Jul 29, 2022 at 16:57 UTC (comment 15.16)

Replying to sxmboer2:

The diff file I am attaching fixes various search issue related to multi-line searches and replacements; both in normal mode as well as in regex search mode.

A tests in tests/lib/search are highly desirable for this patch.

@mc-butler
Copy link
Author

Changed by sxmboer (sxmboer@….net) on Aug 7, 2022 at 2:14 UTC (comment 17)

Test suite was a good idea, but it took me a significant amount of time to figure out how to use the unit test system. The added source code includes tests that are somewhat specific to this ticket.
It should be easy to add tests for the other search modes, whenever they are relevant.
Adding patch test-search.diff

@mc-butler
Copy link
Author

Changed by sxmboer (sxmboer@….net) on Aug 7, 2022 at 2:20 UTC

Adding a test suite for lib/search

@mc-butler
Copy link
Author

Changed by andrew_b (@aborodin) on Aug 7, 2022 at 15:53 UTC (comment 18)

  • Milestone changed from Future Releases to 4.8.29
  • Status changed from new to accepted
  • Owner set to andrew_b
  • Branch state changed from no branch to on review

Cool. Thanks!

Branch: 400_multiline_search
[89bb5b60104178097b98e7c7fd2b28bfabc4ebbb]

@mc-butler
Copy link
Author

Changed by andrew_b (@aborodin) on Aug 14, 2022 at 15:01 UTC (comment 19)

  • Votes set to andrew_b
  • Branch state changed from on review to approved

@mc-butler
Copy link
Author

Changed by andrew_b (@aborodin) on Aug 14, 2022 at 15:02 UTC (comment 20)

  • Branch state changed from approved to merged
  • Votes changed from andrew_b to committed-master
  • Resolution set to fixed
  • Status changed from accepted to testing

Merged to master: [08cca8a].

@mc-butler
Copy link
Author

Changed by andrew_b (@aborodin) on Aug 14, 2022 at 15:04 UTC (comment 21)

  • Status changed from testing to closed

@mc-butler
Copy link
Author

Changed by andrew_b (@aborodin) on Jan 15, 2023 at 13:15 UTC (comment 22)

  • Branch state changed from merged to no branch
  • Status changed from closed to reopened
  • Resolution fixed deleted
  • Milestone changed from 4.8.29 to Future Releases
  • Votes committed-master deleted

Multi-line search has some bugs. See #4429 for details.

@mc-butler mc-butler reopened this Feb 27, 2025
@mc-butler mc-butler marked this as a duplicate of #3454 Feb 28, 2025
@mc-butler mc-butler marked this as a duplicate of #4395 Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: search Search subsystem prio: medium Has the potential to affect progress
Development

No branches or pull requests

2 participants